Trustworthy Enhanced Multi-view Multi-modal Alzheimer’s Disease Prediction with Brain-wide Imaging Transcriptomics Data

Shan Cong Qingdao Innov & Dev Center
Harbin Engineering University
Qingdao, China
[email protected]
   Zhoujie Fan Qingdao Innov & Dev Center
Harbin Engineering Univ.
Qingdao, China
[email protected]
   Hongwei Liu College of Intel Sys Sci & Eng
Harbin Engineering Univ.
Harbin, China
[email protected]
   Yinghan Zhang Department of Surgery
City University of Hong Kong
Hong Kong, China
[email protected]
   Xin Wang Department of Surgery
City University of Hong Kong
Hong Kong, China
[email protected]
   Haoran Luo College of Intel Sys Sci & Eng
Harbin Engineering University
Harbin, China
[email protected]
   Xiaohui Yao Qingdao Innovation & Development Center
Harbin Engineering University
Qingdao, China
[email protected]
Abstract

Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer’s disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imaging-derived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at https://github.com/Yaolab-fantastic/TMM.

Index Terms:
Brain transcriptomics, multimodal imaging, trustworthy learning, cross-modal attention, Alzheimer’s disease

I Introduction

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly, characterized by the gradual deterioration of cognitive functions, such as memory, reasoning, and decision-making [1, 2]. Studies have demonstrated that the pathological changes associated with AD begin to manifest years or even decades before the earliest clinical symptoms are observed [3, 4].

Recent advances in acquiring biomedical data, coupled with the rapid advancements in machine learning and deep learning techniques, have significantly enhanced the development of computer-aided diagnosis of AD. The multifactorial nature of AD highlights the need for multi-modal learning, where numerous algorithms have been developed to capture and integrate complementary information from various modalities. Existing multimodal methods for AD prediction primarily focus on using brain imaging data [5, 6, 7, 8, 9]. For example, Qiu et al. [7] developed a local-aware convolution to capture intra-modality associations and employed a self-adaptive Transformer (SAT) to learn inter-modality global relationships. Zhang et al. [9] proposed an attention-based fusion framework to selectively extract features from MRI and PET branches.

Although imaging modalities effectively quantify the structural and functional alterations in the brain, they do not account for the molecular mechanisms underlying these changes. A few studies have started to incorporate genoty** data to enhance model performance [10, 11, 12, 13]. For example, Zhou et al. [10] proposed an attentive deep canonical correlation analysis to integrate imaging modalities with genetic SNP data for diagnostic purposes. While genotypic data can characterize individuals, this upstream information does not directly explain cellular function and pathological states. Compared to genetics, downstream transcriptomics provides more direct insights by reflecting the actual activity and expression of genes within specific tissue contexts. This motivates us to incorporate brain-specific transcriptomics with imaging data into AD prediction.

Besides improving representation learning, there has been a growing emphasis on develo** various fusion strategies to capture the complementary information present across modalities [14, 15, 16, 17]. For example, Song et al. [16] developed functional and structural graphs using fMRI and DTI images, and combined the multimodal information into edges through an efficient calibration mechanism. Bi et al. [17] developed a community graph convolutional neural network to model brain region-gene interactions, using an affinity aggregation model to enhance interpretability and classification performance in AD diagnosis. Although effective, existing multimodal algorithms often lack a dynamical perception of the informativeness of each modality for different samples, which could otherwise enhance the trustworthiness, stability, and explainability of these methods. Evidence has shown that the informativeness of a modality typically varies across different samples [18, 19], which motivates efforts to model modality informativeness to enhance fusion effectiveness [20].

To address the aforementioned issues, we propose a trustworthy enhanced multi-view multi-modal graph network (TMM) that integrates brain-wide transcriptomics knowledge and multi-modal radiomics data for predicting AD in an adaptive and trustworthy manner. In this context, multi-view refers to the incorporation of both transcriptomics and radiomics data, while multi-modal pertains to including different imaging modalities. As shown in Figure 1, we begin by constructing view-specific region of interest (ROI) co-function networks (RRIs) from brain-wide gene expression data and brain imaging data, to capture the molecular, structural, and functional relationships underlying different brain regions. Graph attention networks (GAT) are then applied to generate sample-wise embeddings, followed by cross-attention to fuse multi-view representations. In the fusion stage, we propose a novel true-false-harmonized class probability (TFCP) criterion to measure modality confidence, facilitating the adaptive perception of and response to variations in modality informativeness. Our main contributions can be summarized as follows:

  • We propose a multi-view, multi-modal learning framework for AD prediction that integrates brain-specific transcriptomics and brain imaging data. This approach models the co-functional relationships of ROIs by combining insights from upstream molecular activities and downstream anatomical and functional characteristics.

  • We propose a novel modality confidence learning strategy that estimates the harmonized true and false class probability to dynamically assess the modality informativeness. This design enables a more robust and trustworthy multimodal prediction.

  • Our proposed model significantly outperforms state-of-the-art methods in predicting AD, LMCI, and EMCI. A series of ablation studies robustly validate the effectiveness of the proposed TMM framework. Furthermore, we identify important ROIs and demonstrate their functional roles in brain cognition.

II Related work

Multimodal AD prediction. Existing multimodal methods for AD typically utilize brain imaging data (e.g., MRI, PET) because of their easy accessibility and early diagnosis potential. Common deep-learning architectures have been widely applied to learn embeddings of each individual modality, for example, convolutional neural network (CNN) [21], recurrent neural network (RNN) [22], autoencoder (AE) [23], generative adversarial network (GAN) [24] and transformer [25]. Most recently, graph neural networks (GNNs) and their variants have gained significant attention thanks to their natural advantages in capturing interaction patterns between ROIs and their adaptability to different data types [26].

Besides imaging, genetics data have also been incorporated to enhance prediction performance [17, 13, 10]; however, genetic variants do not directly contribute to the pathogenesis of AD but instead influence the disease through downstream functional elements (e.g., gene expression). Although high-throughput sequencing techniques have greatly accelerated the acquisition of omics data, their application in brain disorders is challenging due to the invasive nature of sample collection. The scarcity of sample-level brain transcriptomics data inspires us to explore using valuable brain-wide transcriptomics data acquired from healthy individuals as prior knowledge for multimodal AD prediction.

Confidence learning. Although various integration strategies have been proposed (e.g., early [27], intermediate [28], late [29], and hybrid [30] fusion), adapting to modal discrepancies remains challenging, particularly for heterogeneous and imbalanced multimodal data. By assessing and incorporating the predictive uncertainty of each modality into the fusion process, confidence learning has shown significant performance in multimodal prediction. Bayesian approaches introduce probability distributions over the model parameters to estimate uncertainty [31]. Ensemble methods train multiple models independently and use prediction variance as uncertainty [32]. Dropout-based methods, such as MC Dropout [33], leverage dropout regularization to obtain multiple predictions. Confidence calibration methods [34] aim to align the predicted probabilities with the true probabilities of correctness, thereby improving the reliability of uncertainty estimates. Han et al. [20] introduced the true class probability criterion to assess the uncertainty of each modality. Luo et al. [35] estimated the parameters of the Dirichlet distribution based on the prediction of each modality. Zheng et al. [36] assess the decision confidence of each modality by constraining the orthogonalities between each pair of predicted classes. In this paper, we designed a novel confidence calibration criterion to help estimate the informativeness of each modality for trustworthy integration.

Refer to caption
Figure 1: Framework overview. (a) Transcriptomics-specific RRI network (T-RRI) is constructed from the brain-wide gene expression data, and radiomics-specific RRI network is derived from each imaging modality. (b) Three imaging modalities, including VBM, FDG, and AV45, are employed in our analysis. (c) The TMM architecture: (i) For each sample, modality-specific T-RRI and R-RRI networks are constructed by map** ROI measurements to each node. Multi-level GAT is applied to capture interactions within ROIs for each view, generating view-specific representations; followed by cross-view attention to create multi-view embeddings. (ii) TFCP-based confidence networks are designed to estimate the prediction confidence of each modality. (iii) Cross-modal attention mechanisms fuse multiple modalities to deliver the final predictive output. (d-f) illustrates the details of cross-view graph fusion, the TFCP mechanism, and the cross-attention mechanisms, respectively. (g) The identified ROI biomarkers are annotated for their functional connection and co-activation.

III Methods

In the following sections, we denote matrices as boldface uppercase letters and vectors as boldface lowercase ones. Let 𝐓={𝐭1,,𝐭d}ng×d𝐓subscript𝐭1subscript𝐭𝑑superscriptsubscript𝑛𝑔𝑑\mathbf{T}=\{\mathbf{t}_{1},\dots,\mathbf{t}_{d}\}\in\mathbb{R}^{n_{g}\times d}bold_T = { bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT represent the brain-wide transcriptomics data, with ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT denoting the number of genes and d𝑑ditalic_d denoting the number of ROIs. Let m[1,,M]𝑚1𝑀m\in[1,\dots,M]italic_m ∈ [ 1 , … , italic_M ] denotes the m𝑚mitalic_m-th imaging modality, 𝐗m={𝐱1m,,𝐱dm}n×dsuperscript𝐗𝑚subscriptsuperscript𝐱𝑚1subscriptsuperscript𝐱𝑚𝑑superscript𝑛𝑑\mathbf{X}^{m}=\{\mathbf{x}^{m}_{1},\dots,\mathbf{x}^{m}_{d}\}\in\mathbb{R}^{n% \times d}bold_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT represents the m𝑚mitalic_m-th imaging feature matrix, and 𝐲=[y1,,yn]n𝐲subscript𝑦1subscript𝑦𝑛superscript𝑛\mathbf{y}=[y_{1},\dots,y_{n}]\in\mathbb{R}^{n}bold_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denotes the label vector, where n𝑛nitalic_n is the number of samples and d𝑑ditalic_d is the number of ROIs.

We now detail the proposed TMM framework. Initially, one transcriptomics-specific network (T-RRI) and M𝑀Mitalic_M radiomics-specific RRI networks (R-RRIs) are constructed from brain-wide gene expression data and each imaging modality dataset, respectively. For each imaging modality, ROI-level imaging measurements are assigned as node features to both the T-RRI and R-RRI networks, generating graph representations from transcriptomic and imaging views. GATs are then applied to each RRI network to generate view-specific embeddings. These embeddings are integrated using cross-modal attention mechanisms to form multi-view representations. Subsequently, a confidence learning network is specifically designed to enhance the reliability of the embeddings, which are further integrated using cross-modal attention for final prediction. The following sections provide further details on the construction of RRI networks, multi-view representation and fusion, trustworthy evaluation, and multi-modal integration. The overall architecture and important modules are illustrated in Figure 1.

III-A Sample-wise Co-functional RRI Network Construction

As shown in Figure 1(a), two types of RRI networks (T-RRI and R-RRI) are constructed from transcriptomics and imaging data, respectively, to model the co-functionality of ROIs and facilitate graph-based network implementation. Given the brain-wide gene expression matrix 𝐓ng×d𝐓superscriptsubscript𝑛𝑔𝑑\mathbf{T}\in\mathbb{R}^{n_{g}\times d}bold_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, we derive the edge matrix 𝐄t={eijt}d×dsuperscript𝐄𝑡superscriptsubscript𝑒𝑖𝑗𝑡superscript𝑑𝑑\mathbf{E}^{t}=\{e_{ij}^{t}\}\in\mathbb{R}^{d\times d}bold_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT by thresholding the adjacency matrix, which measures the correlation coefficient for each pair of ROIs across all expressed genes:

eijt={1 if r(𝐭i,𝐭j)λt0 otherwise,superscriptsubscript𝑒𝑖𝑗𝑡cases1 if 𝑟subscript𝐭𝑖subscript𝐭𝑗subscript𝜆𝑡0 otherwise\displaystyle e_{ij}^{t}=\left\{\begin{array}[]{cl}1&\text{ if }r(\mathbf{t}_{% i},\mathbf{t}_{j})\geq\lambda_{t}\\ 0&\text{ otherwise}\end{array}\right.,italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if italic_r ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY , (1)

where r(𝐭i,𝐭j)=k=1ng(𝐭ik𝐭i¯)(𝐭jk𝐭j¯)k=1ng(𝐭ik𝐭i¯)2k=1ng(𝐭jk𝐭j¯)2𝑟subscript𝐭𝑖subscript𝐭𝑗superscriptsubscript𝑘1subscript𝑛𝑔subscript𝐭𝑖𝑘¯subscript𝐭𝑖subscript𝐭𝑗𝑘¯subscript𝐭𝑗superscriptsubscript𝑘1subscript𝑛𝑔superscriptsubscript𝐭𝑖𝑘¯subscript𝐭𝑖2superscriptsubscript𝑘1subscript𝑛𝑔superscriptsubscript𝐭𝑗𝑘¯subscript𝐭𝑗2r(\mathbf{t}_{i},\mathbf{t}_{j})\!\!=\!\!\frac{\sum_{k=1}^{n_{g}}(\mathbf{t}_{% ik}-\overline{\mathbf{t}_{i}})\cdot(\mathbf{t}_{jk}-\overline{\mathbf{t}_{j}})% }{\sqrt{\sum_{k=1}^{n_{g}}(\mathbf{t}_{ik}-\overline{\mathbf{t}_{i}})^{2}}% \cdot\sqrt{\sum_{k=1}^{n_{g}}(\mathbf{t}_{jk}-\overline{\mathbf{t}_{j}})^{2}}}italic_r ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_t start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - over¯ start_ARG bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ⋅ ( bold_t start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT - over¯ start_ARG bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_t start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - over¯ start_ARG bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_t start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT - over¯ start_ARG bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG computes the pearson correlation coefficient (PCC) of ROIs 𝐭isubscript𝐭𝑖\mathbf{t}_{i}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐭jsubscript𝐭𝑗\mathbf{t}_{j}bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT expressions, and λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a hyperparameter for thresholding.

We apply the same strategy to each imaging modality, obtaining M𝑀Mitalic_M imaging-specific edge matrix 𝐄m={eijm}d×dsuperscript𝐄𝑚superscriptsubscript𝑒𝑖𝑗𝑚superscript𝑑𝑑\mathbf{E}^{m}=\{e_{ij}^{m}\}\in\mathbb{R}^{d\times d}bold_E start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT (m[1,,M]𝑚1𝑀m\in[1,\dots,M]italic_m ∈ [ 1 , … , italic_M ]), with each element ei,jmsubscriptsuperscript𝑒𝑚𝑖𝑗e^{m}_{i,j}italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT representing the thresholded PCC (by λrsubscript𝜆𝑟\lambda_{r}italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) between imaging measures of each ROI pair across all n𝑛nitalic_n samples. Across experiments, we set the thresholds λt=0.2subscript𝜆𝑡0.2\lambda_{t}=0.2italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.2 for transcriptomics-based networks and λr=0.1subscript𝜆𝑟0.1\lambda_{r}=0.1italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.1 for radiomics-based networks, respectively.

Afterward, for each imaging modality, we devise one T-RRI network and one R-RRI network for each subject by assigning ROI-level imaging measurements to their corresponding nodes as features. Consequently, each subject is associated with M𝑀Mitalic_M T-RRI graphs and M𝑀Mitalic_M R-RRI graphs. Notably, the M𝑀Mitalic_M T-RRI graphs share the same edge matrix but differ in node features, while the M𝑀Mitalic_M R-RRI graphs each have unique edge matrices and distinct node features.

III-B Multi-view Fusion for Modal-Specific Representation

View-specific representation. Given their ability to effectively aggregate neighboring relationships within graph data, we utilize GATs for multi-view sample representation learning. Specifically, GATs are applied to each T-RRI and R-RRI graph, enabling the sample representations from both transcriptomic and different imaging views. Illustrated with the m𝑚mitalic_m-th modality, we take 𝒢t,0m=(𝐗m,𝐄t)subscriptsuperscript𝒢𝑚𝑡0superscript𝐗𝑚superscript𝐄𝑡\mathcal{G}^{m}_{t,0}=(\mathbf{X}^{m},\mathbf{E}^{t})caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT = ( bold_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and 𝒢r,0m=(𝐗m,𝐄m)subscriptsuperscript𝒢𝑚𝑟0superscript𝐗𝑚superscript𝐄𝑚\mathcal{G}^{m}_{r,0}=(\mathbf{X}^{m},\mathbf{E}^{m})caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 0 end_POSTSUBSCRIPT = ( bold_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_E start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) as input T-RRI and R-RRI graphs and stack graph attention layers with multi-head attention to build the GAT for each graph (as shown in Figure 1(d)). Each layer is defined as:

𝒉u=k=1Kσ(v𝒩uαuvk𝐖k𝒉v),\boldsymbol{h}_{u}^{\prime}=\|_{k=1}^{K}\sigma\left(\sum\limits_{v\in{\mathcal% {N}_{u}}}\alpha_{uv}^{k}\mathbf{W}^{k}\boldsymbol{h}_{v}\right),bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∥ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_σ ( ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , (2)

where parallel-to\parallel is concatenation of K𝐾Kitalic_K heads, 𝒉vsubscript𝒉𝑣\boldsymbol{h}_{v}bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the input features of node v𝑣vitalic_v, αuvksuperscriptsubscript𝛼𝑢𝑣𝑘\alpha_{uv}^{k}italic_α start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the k𝑘kitalic_k-th normalized attention coefficients, 𝐖ksuperscript𝐖𝑘\mathbf{W}^{k}bold_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the weight matrix of head k𝑘kitalic_k, 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the first-order neighbors of node u𝑢uitalic_u, and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is a nonlinear activation function. αuvsubscript𝛼𝑢𝑣\alpha_{uv}italic_α start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT is calculated from the attention mechanism αu,v=exp(a(𝒉u,𝒉v))v𝒩uexp(a(𝒉u,𝒉v))subscript𝛼𝑢𝑣𝑎subscript𝒉𝑢subscript𝒉𝑣subscriptsuperscript𝑣subscript𝒩𝑢𝑎subscript𝒉𝑢subscript𝒉superscript𝑣\alpha_{u,v}=\frac{\exp(a(\boldsymbol{h}_{u},\boldsymbol{h}_{v}))}{\sum\limits% _{v^{\prime}\in\mathcal{N}_{u}}\exp(a(\boldsymbol{h}_{u},\boldsymbol{h}_{v^{% \prime}}))}italic_α start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_a ( bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_a ( bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) end_ARG, with a(𝒉u,𝒉v)𝑎subscript𝒉𝑢subscript𝒉𝑣a(\boldsymbol{h}_{u},\boldsymbol{h}_{v})italic_a ( bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) representing the importance of v𝑣vitalic_v to u𝑢uitalic_u.

We further adopted a multi-level strategy to enhance information aggregation between brain ROIs. Specifically, higher-level graphs 𝒢t,1msubscriptsuperscript𝒢𝑚𝑡1\mathcal{G}^{m}_{t,1}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT and 𝒢r,1msubscriptsuperscript𝒢𝑚𝑟1\mathcal{G}^{m}_{r,1}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 1 end_POSTSUBSCRIPT are generated by applying GAT to initial graphs 𝒢t,0msubscriptsuperscript𝒢𝑚𝑡0\mathcal{G}^{m}_{t,0}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT and 𝒢r,0msubscriptsuperscript𝒢𝑚𝑟0\mathcal{G}^{m}_{r,0}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 0 end_POSTSUBSCRIPT. Then, we derived the next level graphs, 𝒢t,2msubscriptsuperscript𝒢𝑚𝑡2\mathcal{G}^{m}_{t,2}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT and 𝒢r,2msubscriptsuperscript𝒢𝑚𝑟2\mathcal{G}^{m}_{r,2}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 2 end_POSTSUBSCRIPT from 𝒢t,1msubscriptsuperscript𝒢𝑚𝑡1\mathcal{G}^{m}_{t,1}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT and 𝒢r,1msubscriptsuperscript𝒢𝑚𝑟1\mathcal{G}^{m}_{r,1}caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 1 end_POSTSUBSCRIPT respectively. For each modality, graph embeddings from three levels are concatenated to produce view-specific representations 𝐅tmsubscriptsuperscript𝐅𝑚𝑡\mathbf{F}^{m}_{t}bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐅rmsubscriptsuperscript𝐅𝑚𝑟\mathbf{F}^{m}_{r}bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

Multi-view fusion. Within each modality, we introduce cross-view attention mechanisms to integrate representations from the T-RRI and R-RRI networks (as shown in Figure 1(d)), enabling them to complement and enhance each other:

𝐙tmsubscriptsuperscript𝐙𝑚𝑡\displaystyle\mathbf{Z}^{m}_{t}bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Att(𝐅tm𝐖tQ,𝐅rm𝐖rK,𝐅rm𝐖rV),absentAttsubscriptsuperscript𝐅𝑚𝑡subscriptsuperscript𝐖𝑄𝑡subscriptsuperscript𝐅𝑚𝑟subscriptsuperscript𝐖𝐾𝑟subscriptsuperscript𝐅𝑚𝑟subscriptsuperscript𝐖𝑉𝑟\displaystyle=\text{Att}\Big{(}\mathbf{F}^{m}_{t}\mathbf{W}^{Q}_{t},\mathbf{F}% ^{m}_{r}\mathbf{W}^{K}_{r},\mathbf{F}^{m}_{r}\mathbf{W}^{V}_{r}\Big{)},= Att ( bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (3)
𝐙rmsubscriptsuperscript𝐙𝑚𝑟\displaystyle\mathbf{Z}^{m}_{r}bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =Att(𝐅rm𝐖rQ,𝐅tm𝐖tK,𝐅tm𝐖tV),absentAttsubscriptsuperscript𝐅𝑚𝑟subscriptsuperscript𝐖𝑄𝑟subscriptsuperscript𝐅𝑚𝑡subscriptsuperscript𝐖𝐾𝑡subscriptsuperscript𝐅𝑚𝑡subscriptsuperscript𝐖𝑉𝑡\displaystyle=\text{Att}\Big{(}\mathbf{F}^{m}_{r}\mathbf{W}^{Q}_{r},\mathbf{F}% ^{m}_{t}\mathbf{W}^{K}_{t},\mathbf{F}^{m}_{t}\mathbf{W}^{V}_{t}\Big{)},= Att ( bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
𝐙msuperscript𝐙𝑚\displaystyle\mathbf{Z}^{m}bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT =𝐙tm𝐙rm,absentsubscriptsuperscript𝐙𝑚𝑡subscriptsuperscript𝐙𝑚𝑟\displaystyle=\mathbf{Z}^{m}_{t}\operatorname*{\mathchoice{\Big{\|}}{\big{\|}}% {\|}{\|}}\mathbf{Z}^{m}_{r},= bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ,

where 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and 𝐖Vsuperscript𝐖𝑉\mathbf{W}^{V}bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are parameters to learn , \cdot\operatorname*{\mathchoice{\Big{\|}}{\big{\|}}{\|}{\|}}\cdot⋅ ∥ ⋅ is concatenation.

In addition to fusing transcriptomics and radiomics information, we also train a modal-specific classifier to incorporate within-modal information into prediction. In particular, for the m𝑚mitalic_m-th modality, a GAT classifier is trained as:

GATm=i=1nCE(y^im,yi),superscriptsubscriptGAT𝑚superscriptsubscript𝑖1𝑛subscriptCEsuperscriptsubscript^𝑦𝑖𝑚subscript𝑦𝑖\mathcal{L}_{\text{GAT}}^{m}=\sum\limits_{i=1}^{n}\mathcal{L}_{\text{CE}}(\hat% {y}_{i}^{m},y_{i}),caligraphic_L start_POSTSUBSCRIPT GAT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is the cross-entropy loss, n𝑛{n}italic_n is the number of training samples, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the true label, and yim^^superscriptsubscript𝑦𝑖𝑚\hat{y_{i}^{m}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG is the prediction from 𝐙msuperscript𝐙𝑚\mathbf{Z}^{m}bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

III-C Modality-level Confidence: True-False-Harmonized Class Probability

In multimodal learning, data heterogeneity is a common challenge, with the discriminative capability of each modality varying across different samples. Therefore, it is crucial to develop algorithms that can adaptively perceive and respond to these variations in informativeness. By transforming the assessment of modal informativeness into an evaluation of the confidence levels associated with modal classification performance, we can effectively estimate the modality informativeness through its prediction confidence. Here, we propose a novel TFCP criterion to approximate the prediction confidence of each modality.

The confidence criterion measures how confident the model is in its predictions by correlating high prediction certainty with greater values and vice versa. For a classifier f:𝐱iyi:𝑓subscript𝐱𝑖subscript𝑦𝑖f:\mathbf{x}_{i}\rightarrow y_{i}italic_f : bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the traditional confidence criterion utilizes the maximum class probability (MCP): MCP(𝐱i)=P(yi^|𝐱i)MCPsubscript𝐱𝑖𝑃conditional^subscript𝑦𝑖subscript𝐱𝑖\text{MCP}(\mathbf{x}_{i})=P(\widehat{y_{i}}|\mathbf{x}_{i})MCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where yi^^subscript𝑦𝑖\widehat{y_{i}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the class with the largest softmax probability. However, it can be observed that this method assigns high confidence (i.e., high softmax probability) to both correct and incorrect predictions, resulting in overconfidence for failure predictions. True class probability (TCP) criterion [37] has been proposed to calibrate incorrect predictions. Instead of assigning the maximum class probability, TCP uses the softmax probability of the true class yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as prediction confidence: TCP(𝐱i)=P(yi|𝐱i)TCPsubscript𝐱𝑖𝑃conditionalsuperscriptsubscript𝑦𝑖subscript𝐱𝑖\text{TCP}(\mathbf{x}_{i})=P(y_{i}^{*}|\mathbf{x}_{i})TCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the true class (i.e., the actual label). The TCP strategy exclusively models the certainty of the true class while neglecting the uncertainty of the untrue classes, which could result in a biased and unstable confidence approximation.

To this end, we propose considering both true and false class probabilities to aggregate evidence from these two perspectives. Specifically, we design a harmonized criterion for approximating the prediction confidence:

TFCP(𝐱i)TFCPsubscript𝐱𝑖\displaystyle\text{TFCP}(\mathbf{x}_{i})TFCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =21/TCP(𝐱i)+1/(1FCP(𝐱i))absent21TCPsubscript𝐱𝑖11FCPsubscript𝐱𝑖\displaystyle=\frac{2}{1/{\text{TCP}(\mathbf{x}_{i})}+1/({1-\text{FCP}(\mathbf% {x}_{i})})}= divide start_ARG 2 end_ARG start_ARG 1 / TCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1 / ( 1 - FCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG (5)
=21/P(yi|𝐱i)+1/(1P(yi¯|𝐱i)),absent21𝑃conditionalsuperscriptsubscript𝑦𝑖subscript𝐱𝑖11𝑃conditional¯superscriptsubscript𝑦𝑖subscript𝐱𝑖\displaystyle=\frac{2}{1/{P(y_{i}^{*}|\mathbf{x}_{i})}+1/{(1-P(\overline{y_{i}% ^{*}}|\mathbf{x}_{i}))}},= divide start_ARG 2 end_ARG start_ARG 1 / italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1 / ( 1 - italic_P ( over¯ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG ,

where FCP(𝐱i)=P(yi¯|𝐱i)FCPsubscript𝐱𝑖𝑃conditional¯superscriptsubscript𝑦𝑖subscript𝐱𝑖\text{FCP}(\mathbf{x}_{i})=P(\overline{y_{i}^{*}}|\mathbf{x}_{i})FCP ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P ( over¯ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the softmax probability of false (or untrue) class.

It should be noted that neither the TCP nor the FCP can be directly applied to test samples because they both require label information. To address this issue, we introduce two confidence networks to approximate the TCP and FCP, respectively. As shown in Figure 1(e), a TCP confidence network with parameters θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and an FCP confidence network with parameters θFsubscript𝜃𝐹\theta_{F}italic_θ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are constructed based on the sample representations to generate certainty and uncertainty scores. These networks are trained to minimize the discrepancy between predicted and actual scores:

Confm=TFCP(𝐙m)TFCP^(𝐙m,θTm,θFm)2+Clsm.superscriptsubscriptConf𝑚superscriptnormTFCPsuperscript𝐙𝑚^TFCPsuperscript𝐙𝑚subscriptsuperscript𝜃𝑚𝑇subscriptsuperscript𝜃𝑚𝐹2superscriptsubscriptCls𝑚\mathcal{L}_{\mathrm{Conf}}^{m}\!=\!\left\|\text{TFCP}(\mathbf{Z}^{m})-% \widehat{\text{TFCP}}(\mathbf{Z}^{m},\theta^{m}_{T},\theta^{m}_{F})\right\|^{2% }\!+\!\mathcal{L}_{\text{Cls}}^{m}.caligraphic_L start_POSTSUBSCRIPT roman_Conf end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = ∥ TFCP ( bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) - over^ start_ARG TFCP end_ARG ( bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT Cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (6)

Here, both the TCP and FCP networks are built upon a classification network trained using the cross-entropy loss ClsmsuperscriptsubscriptCls𝑚\mathcal{L}_{\text{Cls}}^{m}caligraphic_L start_POSTSUBSCRIPT Cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

III-D Multi-modal Fusion: Cross-modal Attention

From the previous sections, multi-view sample representations for each modality have been derived, and modality informativeness has also been assessed, denoted as 𝐙msuperscript𝐙𝑚\mathbf{Z}^{m}bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and TFCP^msuperscript^TFCP𝑚\widehat{\text{TFCP}}^{m}over^ start_ARG TFCP end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, respectively. By incorporating the modality confidence into the corresponding representations, we can obtain trustworthy representation for each modality:

𝐇m=TFCP^m𝐙msuperscript𝐇𝑚superscript^TFCP𝑚superscript𝐙𝑚\displaystyle\mathbf{H}^{m}=\widehat{\text{TFCP}}^{m}\cdot\mathbf{Z}^{m}bold_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = over^ start_ARG TFCP end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ bold_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (7)

Now, we employ cross-modal attention mechanisms to enhance each modality by leveraging insights from others, subsequently concatenating the enriched modal representations for final prediction:

𝐔msuperscript𝐔𝑚\displaystyle\mathbf{U}^{m}bold_U start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT =j=1,jmMAtt(𝐇m𝐖mQ,𝐇j𝐖jK,𝐇j𝐖jV),\displaystyle=\|_{j=1,j\neq m}^{M}\text{Att}\Big{(}\mathbf{H}^{m}\mathbf{W}^{Q% }_{m},\mathbf{H}^{j}\mathbf{W}^{K}_{j},\mathbf{H}^{j}\mathbf{W}^{V}_{j}\Big{)},= ∥ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT Att ( bold_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (8)
𝐔𝐔\displaystyle\mathbf{U}bold_U =m=1M𝐔m.\displaystyle=\|_{m=1}^{M}\mathbf{U}^{m}.= ∥ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

III-E Objective optimization.

The overall loss is composed of the modality-specific classification loss (Eq. 4), the TFCP loss (Eq. 6), and the cross-entropy loss of the final classification:

=η1m=1MGATm+η2m=1MConfm+Final,subscript𝜂1superscriptsubscript𝑚1𝑀superscriptsubscriptGAT𝑚subscript𝜂2superscriptsubscript𝑚1𝑀superscriptsubscriptConf𝑚subscriptFinal\mathcal{L}=\eta_{1}\sum\limits_{m=1}^{M}\mathcal{L}_{\text{GAT}}^{m}+\eta_{2}% \sum\limits_{m=1}^{M}\mathcal{L}_{\text{Conf}}^{m}+{\mathcal{L}_{\text{Final}}},caligraphic_L = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT GAT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT Conf end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT Final end_POSTSUBSCRIPT , (9)

where η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and η2subscript𝜂2\eta_{2}italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote hyperparameters for adjusting different losses. We set η1=1subscript𝜂11\eta_{1}=1italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and η2=1subscript𝜂21\eta_{2}=1italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 across our experiments.

IV Experiments and Results

IV-A Dataset

AHBA dataset. Brain-wide transcriptomics data are sourced from the Allen human brain atlas (AHBA, human.brain-map.org), including over 58k probes sampled across 3,702 brain locations from six donors. We use the abagen toolbox [39] for data preprocessing and employ the AAL atlas to map brain locations to ROIs, resulting in a number of 15,633 genes expressed across 116 ROIs. To ensure that the prior knowledge introduced closely aligns with the underlying mechanisms of AD, we further filter genes based on a large-scale AD meta-GWAS from the IGAP [38]. Specifically, we use the MAGMA [40] to derive gene-level p𝑝pitalic_p-values from SNP-level GWAS and keep nominally significant ones (i.e., p<𝑝absentp<italic_p <0.05). This process yields a total of 1,216 genes across 116 ROIs for constructing the T-RRI edge matrix.

TABLE I: Demographic characteristics of the ADNI subjects.
Subjects NC EMCI LMCI AD
Number 221 275 190 165
Sex(M/F) 114/107 155/120 111/79 99/66
Age 76.34±6.58 71.50±7.12 74.08±8.55 75.35±7.88
Education 16.40±2.66 16.09±2.61 16.36±2.79 15.87±2.72
NC: normal control; EMCI: early mild cognitive impairment; LMCI: late mild cognitive impairment; AD: Alzheimer’s disease.

ADNI dataset. We conduct experimental validation using subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, https://adni.loni.usc.edu/). Multimodal imaging data–including AV45-PET, FDG-PET, and VBM-MRI, along with corresponding diagnosis labels (NC, EMCI, LMCI, and AD) for each visit, are collected from 851 ADNI participants. Each scan is processed according to established protocols [4], and measurements for 116 ROIs are derived from each modality using the AAL atlas. Detailed characteristic information is listed in Table I.

Benchmark methods. We compare TMM with nine multimodal competitors, including three single-modal classifiers with early fusion (SVM, Lasso, XGBoost, and fully connected NN), three models with intermediate concatenation (GRidge [41], BSPLSDA [42], and GMU [43]), and two methods with advanced representation and fusion designs (Mogonet [44] and Dynamics [20]).

TABLE II: The comparison results on four ADNI classification tasks.
Method NC vs. AD NC vs. LMCI NC vs. EMCI EMCI vs. LMCI
ACC F1 AUC ACC F1 AUC ACC F1 AUC ACC F1 AUC
SVM 90.2±2.3 89.8±1.2 94.1±4.2 65.4±2.3 63.5±2.5 72.1±3.3 67.7±1.2 78.5±1.4 64.9±1.7 62.4±3.1 69.3±3.0 65.5±3.0
Lasso 91.2±3.0 89.8±3.6 96.6±1.4 67.7±4.6 65.6±5.5 76.7±3.1 66.5±3.4 76.2±2.6 66.7±3.2 65.4±3.2 70.7±3.8 68.0±2.0
XGBoot 91.2±1.4 89.7±1.6 96.8±0.8 70.1±1.2 67.3±1.9 76.0±1.5 69.9±3.5 80.3±2.3 61.7±2.7 64.5±1.9 71.6±1.9 67.0±3.0
NN 88.9±2.2 87.5±2.5 51.9±7.0 69.1±2.6 67.6±3.3 52.1±6.3 69.7±2.1 79.4±1.9 60.9±2.8 64.8±4.8 71.0±3.7 47.7±3.7
GRridge 80.6±2.7 78.2±3.3 88.3±2.8 62.7±2.8 61.9±4.0 68.6±3.4 58.1±2.0 67.7±2.7 57.7±2.7 60.4±3.4 65.7±2.9 62.7±2.9
BSPLSDA 83.0±1.6 66.4±2.2 87.8±2.7 63.5±1.6 63.7±1.9 71.8±2.5 58.3±1.7 67.7±2.3 60.2±2.8 61.4±2.7 69.3±3.0 62.5±2.6
GMU 91.3±2.6 90.2±3.1 95.6±2.8 70.9±2.4 68.4±1.5 77.2±1.7 68.4±2.7 78.3±2.8 58.6±2.7 68.5±2.3 64.8±3.3 74.0±2.2
Mogonet 95.3±1.3 95.9±1.7 91.4±1.5 78.6±2.1 77.0±1.7 82.7±2.2 72.5±1.9 69.6±2.7 80.1±2.2 74.8±3.0 78.4±3.4 78.1±3.5
Dynamics 96.5±0.6 95.7±1.1 96.4±0.8 80.4±1.6 78.9±1.6 84.1±1.8 75.8±1.6 70.5±2.3 78.3±1.7 76.9±2.6 79.7±2.7 77.6±2.2
Ours 98.0±1.1* 97.7±1.3* 98.3±0.5* 83.7±1.2* 81.3±1.3* 85.9±1.6 78.6±1.5* 84.4±2.9* 77.3±1.2 80.0±2.0* 82.7±3.3 78.7±2.6
Best results are in Bold. Suboptimal results are underlined. ‘*’ indicates that our TMM is significantly better (p<0.05𝑝0.05p<0.05italic_p < 0.05) than the suboptimal method when using the independent t-test.
TABLE III: Ablation study of the key components in TMM.
T-RRI R-RRI NC vs. AD NC vs. LMCI NC vs. EMCI EMCI vs. LMCI
ACC F1 AUC ACC F1 AUC ACC F1 AUC ACC F1 AUC
98.0±1.1 97.7±1.3 98.3±0.5 83.7±1.2 81.3±1.3 85.9±1.6 78.6±1.5 84.4±2.9 77.3±1.2 80.1±1.7 82.4±2.6 78.5±2.3
97.2±1.2 96.3±1.8 98.0±1.2 82.0±1.3 79.6±2.1 83.4±2.7 76.4±2.2 82.1±3.2 75.6±2.5 77.8±2.5 80.6±2.9 76.3±2.8
96.7±1.4 95.9±2.1 97.2±1.4 81.4±1.6 78.2±1.4 83.6±1.8 77.3±1.4 82.8±2.7 76.8±1.6 78.3±2.3 81.8±2.4 77.5±2.0
95.6±2.6 94.8±2.2 96.3±2.7 79.8±2.4 76.3±2.9 81.7±2.6 75.4±2.3 80.6±2.5 75.4±1.7 76.3±2.7 79.3±3.2 76.7±2.3
Confidence NC vs. AD NC vs. LMCI NC vs. EMCI EMCI vs. LMCI
ACC F1 AUC ACC F1 AUC ACC F1 AUC ACC F1 AUC
TFCP 98.0±1.1 97.7±1.3 98.3±0.5 83.7±1.2 81.3±1.3 85.9±1.6 78.6±1.5 84.4±2.9 77.3±1.2 80.1±1.7 82.4±2.6 78.5±2.3
TCP 97.2±1.1 97.3±1.1 98.1±0.6 81.4±1.6 80.2±1.4 83.2±2.0 78.3±1.2 83.6±2.6 77.0±0.8 79.5±1.4 82.1±1.9 78.2±2.2
NN 96.6±1.6 96.0±2.0 97.3±1.6 79.5±2.3 78.5±2.6 81.5±2.7 77.4±1.5 82.2±2.7 76.3±1.4 77.4±2.4 80.1±2.8 75.2±2.6

Evaluation metrics. We employ accuracy (ACC), F1 score (F1), and the area under the receiver operating characteristic curve (AUC) to evaluate the performance of the methods. Five-fold cross-validation is performed to calculate the mean and standard deviation of the results. T-test is performed to evaluate the significance of improvements achieved by our method over state-of-the-art methods.

Implementation details. We develop the model using PyTorch 2.1.0 and Adam as the optimizer. We train the model for 2,000 epochs with a learning rate of 1e-3 and weight decay of 1e-4. All experiments are implemented on an RTX 4090 GPU with 24GB of memory.

IV-B Comparison with the state-of-the-art

Four tasks, including NC vs. AD, NC vs. LMCI, NC vs. EMCI, and EMCI vs. LMCI, are designed to validate the performance of our proposed method. Since the transcriptomics knowledge is not specific to ADNI subjects and cannot be directly incorporated, we are limited to using only multimodal imaging data for the comparative methods. Table II shows the comparison results where our model illustrates superior performance across four tasks. Particularly, TMM achieves significant improvements (t-test p<0.05𝑝0.05p<0.05italic_p < 0.05) in most metrics over the suboptimal results. This highlights the advantages of integrating molecular knowledge through the multi-view graphs and the proposed trustworthy strategy.

IV-C Ablation studies

We conduct extensive ablation studies to evaluate the effectiveness of including the transcriptomics-derived RRI knowledge, the disease-specific R-RRI networks and the TFCP module. We accordingly remove the T-RRI network and three R-RRI networks, and replace the TFCP with TCP and a fully connected NN. The ablation experiments are performed on four tasks. Table III shows the results, from which we can observe that: 1) Both T-RRI and R-RRI individually improve prediction performance, demonstrating that transcriptomics data and imaging data each offer unique insights into disease pathology; 2) Integrating transcriptomics with imaging data provides the best prediction outcomes, showcasing that biomedical data from varied perspectives can deliver effective complementary insights for AD; 3) Compared to TCP, the proposed TFCP shows enhanced performance, and similarly, TCP outperforms NN, suggesting that combining certainty and uncertainty yields more robust confidence estimates.

TABLE IV: Top 15 imaging biomarkers identified for each task.
Task Modality Top 15 ROIsa
NC vs. AD VBM Amygdala_L, Hippocampus_R
FDG Cingulum_Post_L, ParaHippocampal_L, ParaHippocampal_R
AV45 Hippocampus_L, Hippocampus_R, ParaHippocampal_L, Cerebelum_3_L, Cerebelum_3_R, ParaHippocampal_R, Cerebelum_7b_R, Cingulum_Ant_R, Temporal_Pole_Sup_L, Cerebelum_10_L
NC vs. LMCI VBM Hippocampus_L, Amygdala_L, Hippocampus_R, Temporal_Mid_R
FDG Cingulum_Post_L, Cingulum_Post_R, ParaHippocampal_L, Angular_L, Angular_R, Thalamus_L, Occipital_Inf_L, Frontal_Med_Orb_L
AV45 ParaHippocampal_L, Hippocampus_R, Postcentral_L
NC vs. EMCI VBM Cerebelum_10_L, Hippocampus_L, Hippocampus_R, Amygdala_L, Cerebelum_9_L
FDG Cingulum_Post_L, ParaHippocampal_L, ParaHippocampal_R, Hippocampus_L, Angular_L, Angular_R, Cingulum_Post_R, Amygdala_L, Occipital_Mid_L
AV45 Cerebelum_3_L
EMCI vs. LMCI VBM Thalamus_R, Supp_Motor_Area_L, Hippocampus_L, Hippocampus_R
FDG Cingulum_Post_L, Putamen_R, Putamen_L, Cingulum_Post_R, Pallidum_L, Vermis_7, Pallidum_R, Temporal_Pole_Mid_R, Cerebelum_8_L, Vermis_8, Vermis_9
a: The biomarkers are ranked by combining all three modalities, and the top 15 features for each classification task are listed.

IV-D Evaluations of modality contributions

To illustrate how multiple imaging modalities can offer complementary insights, we compared the classification performance using various combinations of imaging data sources on the NC vs. AD task. The results, displayed in Figure 2(a), reveal that: 1) different imaging modalities possess varying levels of discriminative capabilities, with VBM outperforming both FDG and AV45; 2) the integration of additional imaging modalities consistently yields better performance than using any subset alone, highlighting the unique contribution of each imaging modality and affirming the ability of our model in effectively modeling cross-modal information.

Refer to caption
Refer to caption
Figure 2: Performance comparison of (a) different modality combinations and (b) different hyperparameters for RRI thresholding on NC vs. AD task.
Refer to caption
Figure 3: Coactivation maps of top ROIs obtained from NC vs. AD task.
Refer to caption
Figure 4: Transcriptomics-based (top panel, grey edges) and radiomics-based (bottom panel, green edges) connectivity of top ROIs derived from each modality.

IV-E Hyperparameter analysis.

We evaluate the hyperparameters for thresholding the T-RRI and R-RRI networks. The parameters λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and λrsubscript𝜆𝑟\lambda_{r}italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are automatically tuned from the NC vs. AD classification task within the range of [0.1,0.2,,0.6]0.10.20.6[0.1,0.2,\cdots,0.6][ 0.1 , 0.2 , ⋯ , 0.6 ]. Figure 2(b) shows the grid search results, from which the optimal combination of λt=0.2subscript𝜆𝑡0.2\lambda_{t}=0.2italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.2 and λr=0.1subscript𝜆𝑟0.1\lambda_{r}=0.1italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.1 are selected in our experiments. Variations in the thresholding parameters have minimal impact on performance, demonstrating the robustness of RRI networks.

IV-F Identified biomarkers and functional analysis

Feature ablation experiments [45] are conducted to assess the importance of each imaging feature. Table IV lists the top 15 biomarkers identified for each task. The hippocampus ranks highly in both early and late prediction tasks, aligning well with its recognition as one of the first regions to be affected by AD [46, 47]. Among the top findings, several ROIs (e.g., left and right hippocampus, left posterior cingulum) are consistently present across various tasks, indicating their broad relevance. Some ROIs appear in most tasks (e.g., left amygdala, left parahippocampal gyrus), while some are specific to certain tasks (e.g., left thalamus in LMCI, right thalamus in EMCI vs. LMCI). This variability suggests that different regions may be involved at different AD stages.

We employ Neurosynth (www.neurosynth.org) to perform meta-analytic coactivation analysis, associating the identified ROIs with cognitive functions derived from 3,489 published neuroimaging studies. The coactivation map of top ROIs, as reported from NC vs. AD (refer to Table IV), is illustrated in Figure 3. The co-activated regions are primarily located in the subcortex (e.g., cingulate gyrus) and prefrontal lobe, which are known to be implicated in AD [48, 49].

We further visualize the connectivity of top ROIs underlying transcriptomics and each imaging modality. Specifically, we select the top 15 ROIs from each modality, extract their pairwise PCCs, and use the BrainNet Viewer (https://www.nitrc.org/projects/bnv/) for visualization. For each modality, distinct interaction patterns are observed between the transcriptomics-derived and radiomics-derived networks, illustrating how different biological levels offer unique insights from varying perspectives. This is especially marked in AV45, corresponding with the significant role of AV45 top ROIs (in NC vs. AD, 10 out of 15 ROIs are from the AV45) in classification. Further investigation is warranted into the identified ROIs and their role in AD progression.

V Conclusion

In this paper, we propose TMM, a trustworthy enhanced multi-view multi-modal framework for AD prediction. We integrate upstream molecular knowledge with downstream radiomics information to effectively model relationships between brain ROIs and enhance sample representations. The proposed TFCP criterion successfully perceives sample-wise modal informativeness, facilitating dynamic fusion across multiple modalities. Our experimental results across various AD classification tasks validate the efficacy of the proposed TMM. The identified biomarkers align with existing research and demonstrate relevance to AD progression.

References

  • [1] M. Jucker and L. C. Walker, “Alzheimer’s disease: from immunotherapy to immunoprevention,” Cell, vol. 186, no. 20, pp. 4260–4270, 2023.
  • [2] F. Gaubert and H. Chainay, “Decision-making competence in patients with alzheimer’s disease: A review of the literature,” Neuropsychology Review, vol. 31, no. 2, pp. 267–287, 2021.
  • [3] R. J. Bateman, T. M. Xiong et al., “Clinical and biomarker changes in dominantly inherited alzheimer’s disease,” New England Journal of Medicine, vol. 367, no. 9, pp. 795–804, 2012.
  • [4] X. Yao, S. Cong, J. Yan et al., “Regional imaging genetic enrichment analysis,” Bioinformatics, vol. 36, no. 8, pp. 2554–2560, 2020.
  • [5] S. Qiu, M. I. Miller, P. S. Joshi et al., “Multimodal deep learning for alzheimer’s disease dementia assessment,” Nature communications, vol. 13, no. 1, p. 3404, 2022.
  • [6] Z. Chen, Y. Liu, Y. Zhang, and Q. Li, “Orthogonal latent space learning with feature weighting and graph learning for multimodal alzheimer’s disease diagnosis,” Medical Image Analysis, vol. 84, p. 102698, 2023.
  • [7] Z. Qiu, P. Yang, C. Xiao, et al., “3d multimodal fusion network with disease-induced joint learning for early alzheimer’s disease diagnosis,” IEEE Transactions on Medical Imaging, pp. 1–1, 2024.
  • [8] T. Zhou, K.-H. Thung, M. Liu et al., “Multi-modal latent space inducing ensemble svm classifier for early dementia diagnosis with neuroimaging data,” Medical image analysis, vol. 60, p. 101630, 2020.
  • [9] T. Zhang and M. Shi, “Multi-modal neuroimaging feature fusion for diagnosis of alzheimer’s disease,” J Neurosci Methods, vol. 341, p. 108795, 2020.
  • [10] R. Zhou, H. Zhou, B. Y. Chen, L. Shen, Y. Zhang, and L. He, “Attentive deep canonical correlation analysis for diagnosing alzheimer’s disease using multimodal imaging genetics,” in Medical Image Computing and Computer Assisted Intervention – MICCAI, 2023, pp. 681–691.
  • [11] M. Wang, W. Shao et al., “Hypergraph-regularized multimodal learning by graph diffusion for imaging genetics based alzheimer’s disease diagnosis,” Medical Image Analysis, vol. 89, p. 102883, 2023.
  • [12] W. Ko, W. Jung, E. Jeon et al., “A deep generative–discriminative learning for multimodal representation in imaging genetics,” IEEE Transactions on Medical Imaging, vol. 41, no. 9, pp. 2348–2359, 2022.
  • [13] X. Bi, Y. Mao, S. Luo et al., “A novel generation adversarial network framework with characteristics aggregation and diffusion for brain disease classification and feature selection,” Briefings in Bioinformatics, vol. 23, no. 6, p. bbac454, 2022.
  • [14] Y. Liu, L. Fan, C. Zhang, T. Zhou, Z. Xiao, L. Geng, and D. Shen, “Incomplete multi-modal representation learning for alzheimer’s disease diagnosis,” Medical Image Analysis, vol. 69, p. 101953, 2021.
  • [15] Q. Zhu, B. Xu, J. Huang et al., “Deep multi-modal discriminative and interpretability network for alzheimer’s disease diagnosis,” IEEE Transactions on Medical Imaging, 2022.
  • [16] X. Song, F. Zhou et al., “Graph convolution network with similarity awareness and adaptive calibration for disease-induced deterioration prediction,” Medical Image Analysis, vol. 69, p. 101947, 2021.
  • [17] X. Bi, K. Chen, S. Jiang, S. Luo, W. Zhou, Z. Xing, L. Xu, Z. Liu, and T. Liu, “Community graph convolution neural network for alzheimer’s disease classification and pathogenetic factors identification,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [18] H. Hou, Q. Zheng, Y. Zhao, A. Pouget, and Y. Gu, “Neural correlates of optimal multisensory decision making under time-varying reliabilities with an invariant linear probabilistic population code,” Neuron, vol. 104, no. 5, pp. 1010–1021, 2019.
  • [19] R. Rideaux, K. R. Storrs, G. Maiello, and A. E. Welchman, “How multisensory neurons solve causal inference,” Proceedings of the National Academy of Sciences, vol. 118, no. 32, p. e2106235118, 2021.
  • [20] Z. Han, F. Yang et al., “Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification,” CVPR, pp. 20 675–0685, 2022.
  • [21] T. Kattenborn, J. Leitloff, F. Schiefer et al., “Review on convolutional neural networks (cnn) in vegetation remote sensing,” ISPRS journal of photogrammetry and remote sensing, vol. 173, pp. 24–49, 2021.
  • [22] I. Banerjee, Y. Ling, M. C. Chen et al., “Comparative effectiveness of convolutional neural network (cnn) and recurrent neural network (rnn) architectures for radiology text report classification,” Artificial intelligence in medicine, vol. 97, pp. 79–88, 2019.
  • [23] C.-N. Jiao, Y.-L. Gao, D.-H. Ge et al., “Multi-modal imaging genetics data fusion by deep auto-encoder and self-representation network for alzheimer’s disease diagnosis and biomarkers extraction,” Engineering Applications of Artificial Intelligence, vol. 130, p. 107782, 2024.
  • [24] M. Shi, X. Li et al., “Attention-based generative adversarial networks improve prognostic outcome prediction of cancer from multimodal data,” Briefings in Bioinformatics, vol. 24, no. 6, p. bbad329, 2023.
  • [25] K. Zhang and L. Li, “Explainable multimodal trajectory prediction using attention models,” Transportation Research Part C: Emerging Technologies, vol. 143, p. 103829, 2022.
  • [26] N. Chaari, M. A. Gharsallaoui, H. C. Akdağ, and I. Rekik, “Multigraph classification using learnable integration network with application to gender fingerprinting,” Neural Networks, vol. 151, pp. 250–263, 2022.
  • [27] M. Picard, M.-P. Scott-Boyer, A. Bodein et al., “Integration strategies of multi-omics data for machine learning analysis,” Computational and Structural Biotechnology Journal, vol. 19, pp. 3735–3746, 2021.
  • [28] M. Zitnik, F. Nguyen, B. Wang et al., “Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities,” Information Fusion, vol. 50, pp. 71–91, 2019.
  • [29] M. Alfaro-Contreras, J. J. Valero-Mas, J. M. Iñesta, and J. Calvo-Zaragoza, “Late multimodal fusion for image and audio music transcription,” Expert Systems with Applications, vol. 216, p. 119491, 2023.
  • [30] S. R. Stahlschmidt, B. Ulfenborg, and J. Synnergren, “Multimodal deep learning for biomedical data fusion: a review,” Briefings in Bioinformatics, vol. 23, no. 2, p. bbab569, 2022.
  • [31] M. Abdar, F. Pourpanah, S. Hussain et al., “A review of uncertainty quantification in deep learning: Techniques, applications and challenges,” Information fusion, vol. 76, pp. 243–297, 2021.
  • [32] G. Scalia, C. A. Grambow et al., “Evaluating scalable uncertainty estimation methods for deep learning-based molecular property prediction,” J Chem Inf Model, vol. 60, no. 6, pp. 2697–2717, 2020.
  • [33] Y. Mae, W. Kumagai, and T. Kanamori, “Uncertainty propagation for dropout-based bayesian neural networks,” Neural Networks, vol. 144, pp. 394–406, 2021.
  • [34] R. Moradi, R. Berangi, and B. Minaei, “A survey of regularization strategies for deep models,” Artificial Intelligence Review, vol. 53, no. 6, pp. 3947–3986, 2020.
  • [35] H. Luo, H. Liang et al., “Teminet: A co-informative and trustworthy multi-omics integration network for diagnostic prediction,” International Journal of Molecular Sciences, vol. 25, no. 3, p. 1655, 2024.
  • [36] X. Zheng, C. Tang et al., “Multi-level confidence learning for trustworthy multimodal classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 11 381–11 389, Jun. 2023.
  • [37] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez, “Addressing failure prediction by learning model confidence,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [38] B. W. Kunkle et al., “Genetic meta-analysis of diagnosed alzheimer’s disease identifies new risk loci and implicates aβ𝛽\betaitalic_β, tau, immunity and lipid processing,” Nature genetics, vol. 51, no. 3, pp. 414–430, 2019.
  • [39] R. Markello et al., “Standardizing workflows in imaging transcriptomics with the abagen toolbox,” eLife, vol. 10, p. e72129, 2021.
  • [40] C. A. de Leeuw, J. M. Mooij, T. Heskes, and D. Posthuma, “Magma: generalized gene-set analysis of gwas data,” PLoS computational biology, vol. 11, no. 4, p. e1004219, 2015.
  • [41] V. D. Wiel et al., “Better prediction by use of co-data: adaptive group-regularized ridge regression,” Statistics in medicine, vol. 35, no. 3, pp. 368–381, 2016.
  • [42] A. Singh, C. P. Shannon et al., “Diablo: an integrative approach for identifying key molecular drivers from multi-omics assays,” Bioinformatics, vol. 35, no. 17, pp. 3055–3062, 2019.
  • [43] J. Arevalo, T. Solorio et al., “Gated multimodal units for information fusion,” arXiv preprint arXiv:1702.01992, 2017.
  • [44] T. Wang, W. Shao et al., “Mogonet integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification,” Nat Comm, vol. 12, no. 1, pp. 1–13, 2021.
  • [45] Y. Dai, F. Gieseke, S. Oehmcke, et al., “Attentional feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2021, pp. 3560–3569.
  • [46] M. A. DeTure and D. W. Dickson, “The neuropathological diagnosis of alzheimer’s disease,” Mol Neurodegener, vol. 14, no. 1, p. 32, 2019.
  • [47] S. G. Mueller et al., “Hippocampal atrophy patterns in mild cognitive impairment and alzheimer’s disease,” Human brain map**, vol. 31, no. 9, pp. 1339–1347, 2010.
  • [48] S. W. Scheff, D. A. Price et al., “Synaptic change in the posterior cingulate gyrus in the progression of alzheimer’s disease,” J of Alzheimer’s Disease, vol. 43, no. 3, pp. 1073–1090, 2015.
  • [49] M.-C. Pai, “Time perception in people living with alzheimer’s disease,” Alzheimer’s & Dementia, vol. 17, p. e055996, 2021.