AXIAL: Attention-based eXplainability for Interpretable Alzheimer’s Localized Diagnosis using 2D CNNs on 3D MRI brain scans

Gabriele Lozupone [email protected] Alessandro Bria Francesco Fontanella Claudio De Stefano

Abstract

Accurate early diagnosis of Alzheimer’s disease (AD) is a critical challenge in neurodegenerative disease research. Emerging deep learning-aided diagnostic systems utilizing 3D MRI show promise but often fail to highlight meaningful and well-localized brain areas. This study presents an innovative method for 3D MRI classification via 2D CNNs, designed to enhance the explainability of model decisions. Our approach adopts a soft attention mechanism, enabling 2D CNNs to extract volumetric representations. At the same time, the importance of each slice in decision-making is learned, allowing the generation of a voxel-level attention map to produces an explainable MRI. To test our method and ensure the reproducibility of our results, we chose a standardized collection of MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). On this dataset, our method significantly outperforms state-of-the-art methods in (i) distinguishing AD from cognitive normal (CN) with an accuracy of 0.856 and Matthew’s correlation coefficient (MCC) of 0.712, representing improvements of 2.4% and 5.3% respectively over the second-best, and (ii) in the prognostic task of discerning stable from progressive mild cognitive impairment (MCI) with an accuracy of 0.725 and MCC of 0.443, showing improvements of 10.2% and 20.5% respectively over the second-best. We achieved this prognostic result by adopting a double transfer learning strategy, which enhanced sensitivity to morphological changes and facilitated early-stage AD detection. With voxel-level precision, our method identified which specific areas are being paid attention to, identifying these predominant brain regions: the hippocampus, the amygdala, the parahippocampal, and the inferior lateral ventricles. All these areas are clinically associated with AD development. Our model highlighted 17 well-localized regions (considering the left and right sides separately), focusing strongly on the top 14 regions. In contrast, the compared method highlighted 88 regions, demonstrating our model’s more focused and precise identification. Furthermore, our approach consistently found the same AD-related areas across different cross-validation folds, proving its robustness and precision in highlighting areas that align closely with known pathological markers of the disease.

keywords:

Alzheimer’s disease diagnosis , Explainable AI , 3D MRI , Attention mechanism

^†^†journal: Medical Image Analysis

\affiliation

[1]organization=Department of Electrical and Information Engineering (DIEI), University of Cassino and Southern Lazio, addressline=Via G. Di Biasio 43, city=Cassino, postcode=03043, state=FR, country=Italy

1 Introduction

Alzheimer’s disease (AD) is a chronic neurodegenerative disorder characterized by the irreversible progression of cognitive impairment and gradual death of nerve cells throughout the brain. AD diagnosis and progression prediction present a significant challenge to medical care. Initially characterized by symptoms such as Mild Cognitive Impairment (MCI), AD progressively leads to more severe cognitive decline, behavioral alterations, and loss of functional independence, eventually leading to death (Lam et al., 2013; Loewenstein et al., 2006). AD is a disorder with a rapidly increasing prevalence trend, predicted to affect 1 in 85 people globally by 2050 (Brookmeyer et al., 2007). An early diagnosis of AD is fundamental for both present and future patients once disease-modifying pharmacological treatments are available. In both cases, it will significantly improve the effectiveness of the available treatments, improving quality of life and reducing care costs (Nicoll et al., 2019; Winblad et al., 2016; Wong, 2020).

Magnetic resonance imaging (MRI) is a key diagnostic and prognostic tool for AD. It provides a non-invasive means to observe and analyze in vivo pathological changes in the brain related to AD, facilitating the study of disease evolution (Ewers et al., 2011). MRI analysis is significant as AD is identified by structural and functional changes that occur in dynamically changing morphological patterns, which are appropriately captured with high-resolution MRI (Duchesne et al., 2008; Jack et al., 2003; Klöppel et al., 2008; Vemuri et al., 2009). It is also noteworthy that brain atrophy, a distinctive AD symptom, can be identified through MRI. This form of atrophy serves as a reliable marker of the disease. It is indicative of its progression, as well as being associated with tau deposition and neuropsychological impairments, essential factors in the clinical manifestation of AD (Frisoni et al., 2010).

In recent years, deep learning (DL), particularly with convolutional neural networks (CNNs), has transformed neuroimaging data analysis for AD, moving beyond the traditional machine learning (ML) that focuses on approaches that rely on handcrafted features and classifiers (Falahati et al., 2014; Haller et al., 2011; Rathore et al., 2017). DL’s ability to autonomously extract features at different abstraction levels minimizes the need for extensive image pre-processing and feature selection, offering a more objective and less biased approach in medical imaging. Several researchers have recently proposed various DL-based approaches to diagnose and predict AD using MRI. These methods can be divided into three main categories: (i) analysis of the entire three-dimensional (3D) volume with 3D CNNs (Basaia et al., 2019; Feng et al., 2022; Wu et al., 2022; Venugopalan et al., 2021), (ii) methods based on extracting 3D patches from the volume (Goenka et al., 2022; Liu et al., 2023; Park et al., 2023; Qiu et al., 2020), and (iii) classification of two-dimensional (2D) slices selected from a specific plane (Ebrahimi et al., 2021; Hon and Khan, 2017; Kang et al., 2021; Pan et al., 2020; Tanveer et al., 2021; Zhang et al., 2022). Each of these methods has several advantages and limitations. 3D volume analysis with 3D CNN offers a comprehensive approach. However, it is computationally expensive and suffers from the scarcity of pre-trained models. This limits the opportunities for using transfer learning strategies, which is essential when dealing with small datasets (Yosinski et al., 2014). The 3D patch-based approach, while reducing computational complexity, shares similar limitations due to the lack of pre-trained models. Furthermore, this method does not directly allow the extraction of global 3D volume features crucial for accurate analysis. On the other hand, methods utilizing 2D slices can benefit from the abundance of 2D models pre-trained on extensive image datasets such as ImageNet (Russakovsky et al., 2015). This enables transfer learning, which can significantly increase performance in limited data regimes. Nevertheless, this method also faces a significant limitation: it fails to retain the comprehensive spatial information intrinsic to 3D volumes, which is crucial for discerning neurodegenerative patterns in AD.

The interpretability of DL models remains a significant challenge that hinders the deployment of deep learning-aided diagnostic systems (DLADS) in real-world scenarios (Singh et al., 2020; Van der Velden et al., 2022). Most DLADS often base their explainability on methods that provide post hoc explanations, such as visual inspection by saliency maps. A popular approach in this category is the Gradient-weighted Class Activation Map** (GradCAM) (Selvaraju et al., 2017). This class of explainable artificial intelligence (XAI) methods generally struggles to offer the information needed to identify specific brain regions affected by AD, thus limiting their effectiveness in providing meaningful explanations for model decisions (Viswan et al., 2024). These challenges highlight the need for methodologies that not only achieve high diagnostic accuracy but also improve the explainability of the model by detailing the involvement of specific brain areas.

Refer to caption — Figure 1: Schematic representation of the proposed diagnostic framework. The Diagnosis and XAI framework processes a 3D sMRI brain image to generate two key outputs: three diagnosis networks identifying the condition as either AD or CN from the three possible slicing axes and a corresponding 3D attention map that can be overlapped to the input image to visually highlight the brain regions the network focuses on to derive its diagnosis.

Our research introduces an explainable method for AD detection using 3D MRI. This advancement leverages an attention fusion mechanism of feature maps, enhancing interpretability and accuracy even with limited data. As illustrated in Fig. 1, the framework generates two key outputs: diagnosis and a 3D attention map using three networks based on different MRI slicing axes. The 3D attention map overlay on the original scan produces an explainable MRI highlighting brain regions for visual diagnosis. The significant contributions of this work are summarized as follows:

1.

we introduce a novel classification and XAI approach capable of highlighting brain areas highly correlated with AD without sacrificing performance, even in the case of limited datasets.
2.

we tested a double transfer learning strategy for distinguishing between stable Mild Cognitive Impairment (sMCI) and progressive Mild Cognitive Impairment (pMCI), enhancing model sensitivity to morphological changes indicative of early-stage disease progression.
3.

we evaluated our XAI method using a standardized dataset expressly provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Wyman et al., 2013) and compared its effectiveness to recent transformer-based approaches and related XAI techniques. Using this dataset in combination with standardized pipelines makes our results easily accessible and reproducible, aiding comparison with future studies. The framework implementation to reproduce all XAI, diagnostic and prognostic results can be found here https://github.com/GabrieleLozupone/AXIAL.git.
4.

we propose an approach to quantify how important specific brain regions are in a model’s decision-making process, facilitating the alignment of XAI with medical knowledge about AD, thus promoting usable insights in clinical settings.

The remainder of the paper is organized as follows: Section 2 discusses the related work that supports our study. In Section 1, we describe the materials, including the dataset and pipeline utilized. Section 4 outlines our methods, focusing on the diagnosis network and the proposed XAI approach. Section 5.1 presents the experimental settings, along with diagnostic and XAI results. Discussions of the findings are provided in Section 6, and Section 7 concludes the paper with a summary of our contributions.

2 Related work

As anticipated in Section 1, DL-based analysis of MRI images to diagnose and predict AD can be divided into three broad categories: (i) approaches that use 3D CNNs to analyze the entire 3D volume, (ii) approaches that analyze 3D patches from the whole 3D volume, and (iii) approaches that classify 2D slices. The following subsections report recent research activities belonging to these categories. Furthermore, we discuss the current challenges and issues in DL-based AD analysis, focusing on reliability and interpretability.

2.1 3D CNNs

In Basaia et al. (2019), the authors propose a fully convolutional 3D CNN model replacing typical max-pooling operations with standard convolution layers and test the model to distinguish AD, pMCI and sMCI from the entire subject’s brain MRI scan. The results presented in Feng et al. (2022) show that an MRI-based 3D CNN outperforms other neuroimaging biomarkers of neurodegeneration in prodromal AD and also outperforms amyloid or tau pathology biomarkers. More sophisticated approaches introduce attention mechanisms into 3D CNNs, e.g., the study presented in ** et al. (2019) investigates a novel attention-based 3D ResNet architecture to diagnose AD and explore potential biological markers. Also, a novel Attention-based 3D Multi-scale CNN model was proposed in Wu et al. (2022) to capture better and integrate multiple spatial-scale features of AD. These approaches are very promising and comprehensive since analyzing the entire volume allows the extraction of global information on a spatial scale, unlike techniques based on conventional 2D CNNs. However, they are computationally expensive and require a large dataset during the training phase to generalize well. Furthermore, given the scarcity of pre-trained models, this problem cannot be easily addressed with a transfer learning strategy (Klaiber et al., 2021).

2.2 3D Patch-based

One way to reduce the computational cost and the overfitting risk of 3D CNN-based approaches consists of reducing the model inputs. For this reason, several approaches have been proposed to analyze 3D patches extracted from MRI scans. In Qiu et al. (2020), the authors present a novel computationally efficient patch-level training strategy to train a 3D CNN. They developed a 3D CNN model for the AD classification task, using random subvolumes of MRI scans as training data. This model generates patient-specific disease probability maps that are used to train a Multilayer Perceptron (MLP) to distinguish between AD and CN subjects. On the other hand, in Park et al. (2023), the authors introduce a framework that utilizes a 3D CNN to extract local features from 3D patches in brain MRI scans, forming the basis for patch-level responses. These responses are then processed through a dual-branch approach, combining patch classification and location identification. These methods enhance computational efficiency by focusing on patches of the 3D volume rather than processing the entire volume. However, the capacity for recognizing and learning global-level patterns within the data is limited. Moreover, the encoding process for 3D patches relies on 3D CNNs and restricts the application of transfer learning strategies.

2.3 2D Slice-level

An alternative to 3D CNNs are the more widespread 2D CNNs. Given the 3D nature of the input, to use 2D CNNs, 2D images must be extracted from the MRI image. This process is commonly known as slicing. Slicing can be performed according to one of the three planes of the MRI brain scan: (i) axial, (ii) coronal, and (iii) sagittal (Zhou et al., 2023).

In Wang et al. (2018), the authors propose an eight-layer custom CNN with leaky rectified learning units to classify single-slice MRI images. Recently, the work presented in Carcagnì et al. (2023) investigates the ability of different CNN and Transformer-based models to classify single slices selected from MRI with a mechanism based on Shannon entropy. Although using 2D CNNs for straightforward classification of 2D images is a standard practice, this method is insufficient for making direct predictions about an individual patient. To achieve accurate subject-level predictions, it is necessary to integrate results from multiple-slice analyses.

The study presented in Kang et al. (2021) introduces an ensemble learning architecture based on 2D CNNs, using a multi-model and multi-slice ensemble, in which the majority voting scheme is used to merge the multi-slice decisions of each model. In Kushol et al. (2022), the authors use a fusion transformer block to merge the outputs of a ViT (Dosovitskiy et al., 2020) and a GFNet (Rao et al., 2021), both pre-trained on ImageNet. The proposed approach allows them to correlate features extracted from spatial and frequency domains, but predictions are made at the slice level. Thus, they are combined using a majority voting approach to make a decision at the subject level. Classifying each slice independently and then using majority voting eliminates the opportunity to learn patterns and features that span across slices.

To overcome this limitation, Altay et al. (2021) introduced the Attention Transformer model. This model comprises a VGG-16 as a Base Network and a Transformer Unit. The Base Network extracts feature representations from each brain MRI image slice. Unlike the traditional Transformer architecture that generates a new representation of the input sequence through self-attention, the Transformer Unit performs a cross-attention operation. This operation consists of considering as Query the feature map relating to the central slice of the 3D volume and as Key and Value the feature maps of the others. This method allows the fusion of information from the entire volume, capturing the interrelationships between the central and surrounding slices. Similarly, Hu et al. (2023a) proposed the Conv-Swinformer architecture consisting of a CNN and a Transformer encoder module. The CNN module summarizes the planar features of the MRI slices, and the Transformer module establishes semantic connections in 3D space for these planar features. In this case, the Transformer encoder comprises four Swin Transformer blocks that perform self-attention operations on multiple windows extracted from the feature maps sequence. However, the reliance on cross-attention or self-attention in Transformer-based models, as suggested in (Altay et al., 2021; Hu et al., 2023a), poses challenges in medical scenarios with limited datasets. The extensive parameterization inherent in these mechanisms increases the risk of overfitting when training on smaller and often imbalanced medical datasets. This undermines the model’s generalizability to new data, crucial in medical diagnostics.

The overparameterization problem can be addressed using less complex self-attention variants to build lightweight architectures. In the study Wang et al. (2024), the authors proposed a diagnosis network composed of two innovative modules in combination with basic ResNet blocks. The first is the slice-aware module designed to interpret significant slices and regions. This module performs self-attention between coded sections in an optimized way to be more parameter efficient. The second is the slice-shift module, which allows joint inter- and intra-slice modelling to exchange information with neighbouring slices.

2.4 Challenges in DL-Based AD Analysis

Despite the potential of DL in diagnosing and prognosis AD using MRI images, several critical issues undermine the reliability and reproducibility of the findings. Furthermore, the “black box” nature of DL impacts clinical confidence and transparency of decision-making.

This subsection provides an analysis of the challenges in the AD analysis in terms of reliability, reproducibility and XAI.

2.4.1 Reliability and Reproducibility

Recent reviews (Wen et al., 2020; Zhou et al., 2023) highlight several reliability and reproducibility issues:

1.

Wrong Data Split: Data leakage due to incorrect data splitting is recurring. When datasets are not split at the subject level, data from the same subject may appear in both training and test sets, leading to overestimated model performance. This problem is particularly pronounced in patch or slice-based approaches (Wen et al., 2020).
2.

Late Split in Data Processing: Data processing steps such as data augmentation and feature selection must be conducted post-data splitting to avoid contamination of the test set. If these steps involve test data, it biases the model, rendering the evaluation unreliable (Wen et al., 2020).
3.

Biased Transfer Learning: Transfer learning, while beneficial, can introduce bias, particularly when the source and target domains overlap. For instance, using a network pre-trained on an AD vs CN task for an MCI vs CN task can cause bias if the CN subjects overlap in both tasks’ training or validation sets (Wen et al., 2020).
4.

Absence of an Independent Test Set: The integrity of model evaluation is compromised if the test set is used for any purpose other than final performance evaluation. Hyperparameter optimization must rely on a separate validation set. The absence of an independent test set in some studies leads to inflated performance metrics (Wen et al., 2020).
5.

Differences in Diagnostic Criteria: Variability in diagnostic criteria for AD across different studies creates inconsistencies in ground truth labeling. This variability complicates the comparison of results across studies and hampers the development of universally applicable models (Zhou et al., 2023).
6.

Lack of Reproducibility: A significant challenge in the field is the non-availability of frameworks and models for public use. Without access to open-source code and detailed implementation procedures, reproducibility is hampered. This lack of transparency affects the ability to validate and build upon existing work (Zhou et al., 2023).

Furthermore, Wen et al. (2020) highlights that the prevalent use of the 2D slice approach in AD analysis is often associated with a series of methodological problems. In particular, the authors highlight that among the numerous studies using this approach, only a few have successfully addressed the abovementioned challenges. This fact underlines the need for rigorous methodological standards in the field.

In response to these challenges, our work was carefully developed to avoid common weaknesses identified in previous studies. Section 5.1 details the data splitting for model selection and data augmentation used for this study. Section 4.2.2 outlines the double transfer learning strategy investigated. To ensure reproducibility, we provide an open-source repository with all necessary code and implementation procedures at https://github.com/GabrieleLozupone/AXIAL.git.

2.4.2 Challenges in current XAI approaches

Several studies have highlighted the difficulties that prevent using DLADS on medical images due to the current challenges of XAI methods (Van der Velden et al., 2022; Singh et al., 2020; Viswan et al., 2024). XAI researchers often use self-intuition to determine a good explanation without validating with a medical professional. This missing validation is partially due to the inability to provide quantitative but only qualitative explanations. Qualitative explanations provided for small and specific subsets of subjects make a global evaluation and comparison of XAI techniques difficult. Furthermore, there is no specific correlation between the prediction and the associated brain region. Model-agnostic XAI techniques, such as GradCAM, often produce saliency maps that highlight extensive regions of the brain without sufficient specificity or consistency. Another critical limitation of GradCAM-like methods, mainly when applied to the analysis of 2D slices from MRI scans, is their focus on intra-slice features while neglecting the inter-slice relationships crucial for underlying the 3D brain regions related to AD. Aggregating these saliency maps to form a 3D visualization produces an overly generalized activation map that often highlights different and sparse areas.

As an alternative to traditional saliency-based methods, recent advancements have explored the use of attention-based models to enhance the interpretability of DL models. In this context, Jetley et al. (2018) proposed a trainable attention mechanism to highlight where and in what proportion the network paid attention to input images for classification. Attention amplifies relevant areas and suppresses irrelevant ones, improving interpretability. In Schlemper et al. (2019), grid attention captures anatomical information in medical images, and attention coefficients were used to explain which areas of the image the network focused on. Recently Wang et al. (2024), described in Section 2.3, introduced a slice-aware module to advance XAI in their methodology. Their variant of the attention mechanism enables the extraction of attention weights to determine the importance of each slice in the decision-making process, offering a more precise understanding of model focus.

3 Materials

Table 1: Summary of participant demographics, mini-mental state examination (MMSE) and global clinical dementia rating (CDR) scores

Subjects Samples Age MMSE CDR CN 204 586 $76.31\pm 5.22$ $29.11\pm 1.05$ $0.01\pm 0.16$ AD 191 474 $75.23\pm 7.34$ $22.46\pm 3.38$ $0.86\pm 0.43$ pMCI 121 162 $75.01\pm 6.68$ $26.47\pm 1.83$ $0.481\pm 0.09$ sMCI 110 154 $75.09\pm 7.02$ $27.80\pm 1.74$ $0.45\pm 0.17$

The ADNI MRI Core in Wyman et al. (2013) has created standardized dataset collections comprising scans that met minimum quality control requirements to promote greater rigour in analysis and meaningful comparison of different algorithms. These standardized datasets within the ADNI archive allow researchers to download a complete and consistent set of images efficiently, thus facilitating comparative research into neurodegenerative diseases. We choose the “ADNI1: Complete 1Yr 1.5T” collection as it directly addresses the problem of lack of reproducibility facilitating model performance comparisons. This collection of 1.5 Tesla MRI scans includes 639 subjects who all have screening, along with 6 and 12-month follow-up scans. Furthermore, this dataset represents a restricted subset of the ADNI. Thus, it is an excellent candidate for testing our method in the recurrent medical imaging data scarcity condition.

We converted the dataset into the Brain Imaging Data Structure (BIDS) format (Gorgolewski et al., 2016). A significant advantage of converting the ADNI dataset to a BIDS structure is the availability of specialized Python libraries for handling BIDS-structured data. These libraries, such as PyBIDS (Yarkoni et al., 2023, 2019), provide powerful tools for querying, organizing, and processing neuroimaging data. To perform this conversion, we used Clinica (Routier et al., 2021; Samper-González et al., 2018), an open-source software platform explicitly designed to make clinical neuroscience studies more accessible and reproducible.

During the conversion phase, the ADNI-to-BIDS converter provided in Clinica checks whether the data meets specific criteria, e.g. images that fail quality control are filtered out or, if there are multiple scans for a visit, the ‘preferred scan’ is selected according to specific decision rules (Routier et al., 2021). Consequently, some of the MRIs in the original dataset may be discarded at this stage. The “ADNI1: Complete 1Yr 1.5T” dataset contains 2,042 acquisitions from 639 subjects. During the conversion, 151 scans did not meet the criteria, so the resulting dataset was 1,891 samples without varying the number of subjects. Each subject had a maximum of 3 scans from baseline, 6-month and 12-month follow-up, and consequently, samples from the same patient with a different diagnosis are unlikely. Therefore, to define sMCI patients and pMCI patients, we used the complete clinical data from ADNI, which contains the diagnoses after several years of follow-up.

From the resulting dataset, we have extracted four classes:

1.

CN: 3D brain images of patients diagnosed as CN at the time of image acquisition.
2.

AD: 3D brain images of patients diagnosed with AD at the time of image acquisition.
3.

sMCI: 3D brain images of patients diagnosed with MCI at the time of image acquisition and who remain MCI over time.
4.

pMCI: 3D brain images of patients initially diagnosed with MCI at the time of scan but later diagnosed with AD 36 months post-acquisition.

The demographic information of the subjects in the “ADNI1: Complete 1Yr 1.5T” dataset is presented in Table 1, which provides essential context for our analysis and findings.

4 Methods

This section presents the explainable diagnostic pipeline developed in this study. The process starts with a pre-processing phase described in Fig. 2. The XAI diagnosis network illustrated in Fig. 3 processes the images as a series of 2D slices, producing diagnosis and attention weights to synthesize the slice’s importance in decision-making. Finally, the attention weights of the slices across the three distinct views are combined to create a 3D attention map, as shown in Fig. 4.

In the following sections, we provide full details of each step.

4.1 MRI images pre-processing

As mentioned in Section 2.4, classification results are difficult to compare between studies due to their limited reproducibility. In the study Wen et al. (2020), the authors highlight that variations in components such as participant selection and image pre-processing are critical aspects that directly influence this limitation. It is, therefore, sufficient to avoid ambiguity in both cases to facilitate the comparison between models. In Section 1, we described the procedures required to avoid ambiguity in participant selection. The pre-processing pipeline used comprises three steps:

1.

Bias field correction using the N4ITK method (Tustison et al., 2010);
2.

Affine registration using the SyN algorithm (Avants et al., 2008) from ANTs (Avants et al., 2014) to align each image to the Montreal Neurological Institute (MNI) 152 space with the ICBM 2009c nonlinear symmetric template (Fonov et al., 2009, 2011);
3.

Skull dissection to remove non brain-tissue from the 3D image using Brain Extraction Tool (Smith, 2002) from FSL (Jenkinson et al., 2012).

We used the t1-linear pipeline of Clinica (Routier et al., 2021; Wen et al., 2020) to perform the first two steps.

4.2 Diagnosis network

Radiologists typically review a series of 2D slice images when analyzing medical imaging data, even when examining 3D images. With this in mind, we have developed an efficient and explainable network for diagnosing AD that learns inter-slice relationships. This network, as illustrated in Fig. 3, comprises three key modules described in detail in this section.

4.2.1 Feature Extraction

This module employs a pre-trained 2D convolutional backbone to process a series of $N$ 2D slices. The convolutional backbone, which can be any architecture such as VGG or ResNet, is pre-trained on the ImageNet dataset to handle cases of small dataset sizes in AD diagnosis tasks.

The original data are 3D images with a value for each voxel, so the resulting sliced 2D images are 1-channel images. Since pre-trained weights are for 3-channel images, we summed the pre-trained convolutional filters of the backbone’s first layer. Since these filters operate linearly and sum their results across channels, we can sum all the filters over the input channel dimension in the first layer. This operation is equivalent to replicating the single-channel image three times, but it is more computationally efficient. The extracted slices were resized to $224\times 224$ and normalized.

Each convolutional backbone ends with a max-pooling operation that converts the feature map of each slice into a feature vector of dimension $f_{\text{dim}}$ . This sequential input processing is achieved through sharing convolutional weights across the sequence, similar to a Recurrent Neural Network approach:

f_{i}=\text{MaxPool}(\text{Conv}(...(x_{i})))

where $x_{i}$ represents the $i$ -th slice in the sequence, $f_{i}$ is the resulting feature vector for that slice, and Conv denotes the convolutional operations performed by a convolutional architecture.

4.2.2 Double Transfer Learning

Models pre-trained on large datasets can help overcome challenges posed by limited data, but they may not be enough for complex tasks. As shown in Table 1, the data available for training a network to differentiate between AD and CN individuals is usually more than for distinguishing between sMCI and pMCI. Although both tasks are complex, the latter is considered more difficult. Therefore, it is more likely that transfer learning can distinguish AD from CN more effectively than recognizing the difference between sMCI and pMCI. Since doctors rely on their expertise to assess a patient’s cognitive decline, leading to an AD diagnosis, we suggest fine-tuning a model pre-trained on the AD vs. CN task to the sMCI vs. pMCI task. Following our definition of classes in Section 1, the $AD\cup CN$ set is disjoint from $(sMCI\cup pMCI)\subset MCI$ . As a result, adopting this strategy under these assumptions does not imply data leakage issues.

4.2.3 Attention XAI Fusion Module

This module enables the network to learn inter-slice dependencies and global 3D patterns through the Feature Extraction module. During backpropagation, the shared weights of the convolutional module are updated based on the entire image, thus enabling the learning of 2D patterns in relation to the importance of that slice at a global level. The Attention XAI Fusion module leverages a fully connected (FC) layer to assign importance to each section based on its feature vector. This introduces a parameter-efficient approach, with only $f_{\text{dim}}+1$ learnable parameters:

w_{i}=\text{FC}_{attention}(f_{i})

where $w_{i}$ is the computed weight for slice $i$ , and FC denotes the fully connected layer operation. Normalization of these weights is performed using a softmax function, ensuring they sum to one and represent the slice-importance distribution:

\alpha_{i}=\textit{softmax}(w_{i})=\frac{e^{w_{i}}}{\sum_{j=1}^{N}e^{w_{j}}}

where $\alpha_{i}$ is the normalized importance weight for the $i$ -th slice. The module performs fusion by calculating a weighted sum of the feature vectors using their corresponding weights, generating a composite feature vector representing the entire brain image:

F=\sum_{i=1}^{N}\alpha_{i}f_{i}

This module yields a feature vector $F$ that synthesizes the brain image and a set of attention weights $\{\alpha_{i}\}$ highlighting the significance of each slice.

4.2.4 Diagnosis

In the final Diagnosis module, the synthesized feature vector $F$ is fed into a head network comprising a fully connected layer with two output neurons for the binary classification task (e.g., AD vs. CN):

D=\textit{softmax}(\text{FC}_{\text{head}}(F))

$D$ represents the softmax output, providing a probability distribution over the diagnosis classes, and $\text{FC}_{\text{head}}$ denotes the fully connected layer specific to the diagnosis task.

4.3 XAI Approach

Within our diagnostic framework for AD, we introduce an advanced XAI strategy to generate a 3D attention map, facilitating a deeper understanding of the neural network’s diagnostic focus. As shown in Fig. 4, our approach harnesses attention weights across sagittal, coronal, and axial slicing planes, merging these into a comprehensive representation that highlights key brain regions indicative of AD.

4.3.1 Attention Weight Generation

The implemented XAI method yields attention weights for each of the three principal slicing planes of the brain: sagittal ( $s$ ), coronal ( $c$ ), and axial ( $a$ ). We trained a dedicated Diagnosis and XAI Network for each plane to assess slice importance distribution from different views. The MRI brain image is sliced in the three planes, resulting in three sequences of 2D slices: $N_{s}$ , $N_{c}$ , and $N_{a}$ slices for the sagittal, coronal, and axial planes.

Each network’s Attention XAI module computes a set of attention weights $\mathbf{\alpha}_{s}$ , $\mathbf{\alpha}_{c}$ , and $\mathbf{\alpha}_{a}$ for its corresponding plane. The computation is formalized as follows:

\mathbf{\alpha}_{p}=\textit{softmax}\left(\text{FC}_{attention}(f_{p})\right)

where $p\in\{s,c,a\}$ represents each plane, $f_{p}$ denotes the feature vectors derived from slices within that plane.

4.3.2 3D Attention Map Synthesis

We integrate the attention weights $\mathbf{\alpha}_{s}$ , $\mathbf{\alpha}_{c}$ , and $\mathbf{\alpha}_{a}$ into a unified 3D attention map. The 3D attention map $A$ is constructed by applying the following operation to each voxel within the brain’s imaging volume:

A[i,j,k]=\mathbf{\alpha}_{s}[i]\cdot\mathbf{\alpha}_{c}[j]\cdot\mathbf{\alpha}% _{a}[k]

For every voxel at coordinates $[i,j,k]$ , its attention value is derived from the product of the corresponding sagittal, coronal, and axial attention weights. This process retains and amplifies the diagnostic importance perceived across all three anatomical planes, yielding a comprehensive 3D depiction of attention distribution.

To facilitate interpretation and comparison, we normalize the entire 3D attention map in the range [0, 1], ensuring that the highest values correspond to areas of high diagnostic relevance. We used min-max normalization:

A=\frac{A-\min(A)}{\max(A)-\min(A)}

The resulting map highlights the brain regions most significantly associated with the diagnostic output of the network, offering insights into the pathological hallmarks of AD as learned by the model through its training process.

4.3.3 Brain regions importance quantification

To achieve comparability across subjects, each MRI image, denoted by $I$ , was normalized to the MNI 152 standard space using the transformation function $f_{MNI152}$ , thus $I_{norm}=f_{MNI152}(I)$ . This step is crucial for aligning brain structures across different individuals to facilitate the identification of AD-specific biomarkers.

Generation of Binary Heatmap A binary heatmap, $H_{binary}$ , was generated to isolate regions of significant structural patterns associated with AD, utilizing a threshold $\theta$ set at the 99.9th percentile. The binary heatmap is defined as:

H_{\text{binary}}[i,j,k]=\begin{cases}1,&\text{if }A[i,j,k]>\theta\\ 0,&\text{otherwise}\end{cases}

Overlay of Heatmap on MRI Data For visualization purposes, the MRI data $I_{norm}$ was augmented by overlaying $H_{binary}$ to enhance the saliency of the regions implicated in AD:

I_{XAI}=I_{norm}+H_{\text{binary}}\times\delta

where $\delta$ is an amplification factor set to 10 to increase the prominence of the heatmap overlay.

Identification and Analysis of Regions of Interest Employing a template atlas, we determined regions of interest (ROI) within the brain. For each ROI identified with a unique label $r$ , we computed the overlap $O_{r}$ as follows:

O_{r}=\sum_{i,j,k}(H_{\text{binary}}[i,j,k]>0)\land(T_{atlas}[i,j,k]=r)

where $T_{atlas}$ represents the template atlas data. The quantitative evaluation of each region’s overlap involved several statistical measures calculated from the MRI data. The overlap volume of the heatmap in a given region is:

V_{r}=|O_{r}|

which can be useful to indicate predominant regions in decision process. The mean ( $\mu_{r}$ ) intensity of the heatmap within each identified region is given by

\mu_{r}=\frac{1}{V_{r}}\sum_{(i,j,k)\in O_{r}}A[i,j,k],

which represents the average activity level across the voxels of interest, allowing us to pinpoint regions with consistently high AD-related changes. The standard deviation ( $\sigma_{r}$ ) of the heatmap intensities,

\sigma_{r}=\sqrt{\frac{1}{V_{r}-1}\sum_{(i,j,k)\in O_{r}}(A[i,j,k]-\mu_{r})^{2% }},

provides insight into the variability of attention within the region, indicating the heterogeneity of AD impact across different brain areas. The maximum ( $A_{max,r}$ ) and minimum ( $A_{min,r}$ ) values of the heatmap,

A_{\text{max},r}=\max_{(i,j,k)\in O_{r}}A[i,j,k]

and

A_{\text{min},r}=\min_{(i,j,k)\in O_{r}}A[i,j,k],

respectively, highlighting the extremes in activation within the regions, shedding light on the most and least affected areas within the brain’s AD-implicated regions. Lastly, the percentage of overlap ( $P_{r}$ ) is calculated as

P_{r}=\frac{\sum_{i,j,k}H_{\text{binary}}[i,j,k]}{V_{r}}

to provide a single value that quantifies the portion of a specific brain region involved in the AD-related patterns identified by the network. This measure is a critical indicator of the importance of each region in distinguishing AD patients from CN individuals.

5 Experiment results

In this section, we first present the experimental setup developed to validate the method. Then, we present the results of the model in comparison to recent state-of-the-art attentional models on 2D slices. Finally, we present qualitative and quantitative results of our XAI approach compared to other existing XAI methods.

5.1 Experiment Setup

We implemented all considered methods in PyTorch (Paszke et al., 2019) and trained them on an NVIDIA A100 80GB GPU. The pre-processed input 3D images (see Section 4.1) were converted to a sequence of $N$ slices at runtime. The $N$ slices were selected from the center by a slicing operation from the chosen plane. Specifically, given D, the dimension corresponding to the number of slices along the slicing direction, the selected slices belonged to the interval $[D/2-N/2,D/2+N/2]$ . This slicing operation resulted in $N<D$ slices, thus reducing computational complexity with $N$ considered as an optimization hyperparameter. All tested networks were trained with AdamW Optimizer with base learning rate $1\times 10^{-4}$ and weight decay $1\times 10^{-2}$ . An early stop** strategy selected the best model during the training phase on the validation set with patience 15. The training data augmentation strategy consisted of randomly flip** a 2D slice in the series with 0.3 probability. We implemented a 5-fold cross-validation technique to ensure robustness and generalizability of the results. We divided the dataset into five segments, ensuring each segment acted as a test set in one iteration and as part of the training/validation sets in the others. To prevent data leakage and preserve the integrity of our evaluation, we carefully assigned all 3D images from the same subjects exclusively to one of the training, validation, or test sets. In each fold, the 80% portion allocated for training/validation was further divided using an 80-20 ratio.

5.1.1 Performance Evaluation

We used Accuracy (ACC), Sensitivity (SEN), Specificity (SPE), and Matthews Correlation Coefficient (MCC) as performance metrics. ACC is the most common performance measure for classification, and it consists of the percentage of correctly classified samples over the total. MCC (Chicco and Jurman, 2020) is a correlation coefficient between predictions and true labels; it provides a more informative and truthful score than accuracy when evaluating binary classifications, allowing a more realistic interpretation of classifier performance, especially in the case of unbalanced datasets. SEN and SPEC serve as critical metrics for evaluating the diagnostic accuracy of our model, highlighting its ability to correctly identify positive and negative cases, respectively. In AD vs. CN task, the AD samples are in the positive class, and CN samples are in the negative one. In sMCI vs. pMCI task, the sMCI samples are in the positive class, and pMCI samples are in the negative one.

All the results obtained are averaged on the test sets from the 5-fold cross-validation.

5.2 Diagnostic results

This subsection focuses on our diagnosis network’s results, summarizing the evaluations of various 2D convolutional backbones, the impact of different slicing planes, optimization of network parameters, the effectiveness of double transfer learning, and a comparison with state-of-the-art methods. Almost all experiments use the axial slicing plane due to its common use in clinical applications and because it generally yields better results.

5.2.1 2D backbone

As the first step, we analyzed the effectiveness of different convolutional backbones in the AD vs. CN task. To this aim, two sets of experiments were performed. First, we evaluated the performance of different backbones by considering them in the Feature Extraction Module of our network. Second, to allow an evaluation independent of our method, the best networks for each family were evaluated with a 2D Majority Voting approach, which is also considered one of the baseline methods. In the Majority Voting approach, we train 2D convolutional networks by individually labelling the slices. Consequently, the weights are updated from the slice-level error during training. Then, the predictions of the slices of a single 3D image are aggregated by choosing the most recurring one in the sequence as the final label.

Table 2: Performances varying backbone in Feature Extraction module in AD vs. CN task averaged over 5-fold cross-validation

Backbones	ACC	SPE	SEN	MCC
VGG16	0.829	0.839	0.816	0.655
VGG19	0.819	0.875	0.749	0.633
ResNet34	0.770	0.843	0.681	0.534
ResNet50	0.804	0.860	0.734	0.602
ResNet101	0.785	0.822	0.739	0.563
EfficientNetV2S	0.764	0.805	0.713	0.521
EfficientNetV2M	0.764	0.833	0.679	0.520
DenseNet121	0.798	0.841	0.745	0.590

The first pool of experiments was conducted using a mini-batch of 16 and freezing the first 50% of the backbone’s layers. The convolutional architectures chosen are: (i) VGG16 and VGG19 (Simonyan and Zisserman, 2014), (ii) ResNet34, ResNet50 and ResNet101 (He et al., 2016), (iii) EfficientNetV2 Small and EfficientNetV2 Medium (Tan and Le, 2021) and (iv) DenseNet121 (Huang et al., 2017). The results of this pool are shown in Table 2. In this first analysis, VGG16 had the highest accuracy (ACC: 0.829) and sensitivity (SEN: 0.816), suggesting it is the better-suited backbone for AD patient identification. VGG19, while slightly less accurate, showed superior specificity (SPE: 0.875), indicating it is particularly adept at identifying CN cases. The ResNet series demonstrated that additional depth (as seen in ResNet101) does not equate to improved performance, with ResNet50 showing better accuracy and sensitivity. This could suggest diminishing returns with increased complexity for this task. EfficientNet models trailed behind their counterparts with identical accuracies (ACC: 0.764). DenseNet121, while providing good accuracy (ACC: 0.798), also did not meet the higher results obtained by other architectures. The MCC aligns with these findings, confirming VGG16 as the top-performing model in this case.

Based on the results obtained in Table 2, we selected the best backbone from each architecture family as candidates for the second pool of experiments. Since this pool comprises 2D rather than 3D images, we chose 32 as the mini-batch size. To achieve comparable results, the same percentage of backbone layers was frozen. As shown in Table 3, the performance trends of the architectures remain consistent with previous observations. VGG16 continued to lead in accuracy (ACC: 0.804) and MCC (MCC: 0.605). This result aligns with other works that design their approaches relying on a VGG as base network (Altay et al., 2021; Hu et al., 2023a). Remarkably, our method outperforms the Majority Voting one across several metrics. The superiority can be seen from the average increase across all tested backbones of ACC by 1.6% and MCC by 3.08%. Notably, a good increase was observed with VGG16, with an ACC improvement of 2.5% and an MCC improvement of 5.0%.

Table 3: Performances varying backbone in Majority Voting approach in AD vs. CN task over 5-fold cross-validation.

Backbones	ACC	SPE	SEN	MCC
VGG16	0.804	0.897	0.688	0.605
ResNet50	0.798	0.873	0.707	0.592
EfficientNetV2S	0.759	0.839	0.664	0.516
DenseNet121	0.770	0.848	0.673	0.532

5.2.2 Slicing plane

Although the axial plane is the most commonly used in clinical applications, there could be better choices than this one in this classification scenario. We performed a slicing plane analysis after finding that VGG16 is the best backbone in the diagnostic task. For this set of experiments, we used 100 slices for each view, a batch size of 8, a learning rate 1e-4, and froze the first half of the backbone in training. Table 4 shows the performances averaged over the five cross-validation test sets varying the slicing plane. The results show that the axial plane achieved the best performance. This finding aligns with the clinical choice.

Table 4: Performances varying slicing plane with our approach in AD vs. CN task over 5-fold cross-validation.

Slicing plane	ACC	SPE	SEN	MCC
Axial	0.839	0.872	0.799	0.674
Coronal	0.816	0.897	0.715	0.628
Sagittal	0.820	0.860	0.772	0.636

5.2.3 Parameters Optimization

Since VGG16 was confirmed as the best backbone among those under investigation and the axial plane provided the best results, we selected this configuration for our Feature Extraction module. The second step consisted of optimizing other parameters: the number of slices extracted $N$ , the batch size, and the percentage of backbone to freeze.

Table 5: Network Results using VGG16 as the backbone for Feature Extraction module in AD vs. CN task

Num Slices	Batch Size	Freezing	ACC	SPE	SEN	MCC
70	8	50%	0.821	0.851	0.782	0.636
80	4	50%	0.825	0.867	0.774	0.646
	8	0%	0.818	0.861	0.764	0.630
		25%	0.840	0.892	0.776	0.677
		50%	0.856	0.910	0.792	0.712
		75%	0.799	0.877	0.707	0.603
	16	50%	0.829	0.839	0.816	0.655
90	8	50%	0.818	0.867	0.757	0.630
100	8	50%	0.839	0.872	0.799	0.674
120	8	50%	0.815	0.858	0.761	0.624

The results in Table 5 indicated that varying the number of slices and the percentage of first backbone layers frozen in the network impacted performances significantly. The configuration with 70 slices, batch size of 8, and freezing 50% of the layers resulted in an ACC of 0.821 and MCC of 0.636. When the batch size was reduced to 4 under the same conditions, a slight improvement was observed with an ACC of 0.825 and MCC of 0.646. After increasing the number of slices to 80 and employing a batch size of 8, experiments with 0%, 25%, 50%, and 75% freezing were conducted. The highest accuracy was achieved at 50% freezing, yielding an ACC of 0.856 and MCC of 0.712, highlighting the best overall performance across all configurations. With a further increase to 90 slices, the performance did not improve, maintaining an ACC of 0.818 with a 50% freezing and batch size of 8. Finally, the performance slightly fluctuated with 100 and 120 slices but generally did not surpass the optimal results obtained with 80 slices end 50% freezing.

Table 6: Double transfer learning for sMCI vs. pMCI task at 36 months

Transfer Learning	Freezing	ACC	SPE	SEN	MCC
ImageNet	50%	0.4822	0.427	0.636	0.065
ImageNet + AD vs. CN	50%	0.703	0.809	0.4873	0.396
ImageNet + AD vs. CN	75%	0.725	0.763	0.678	0.443

5.2.4 Double Transfer Learning

The third step involved investigating the utility of transfer learning for predicting AD progression with the double transfer learning strategy proposed in Section 4.2.2. This approach leverages a backbone pre-trained on the ImageNet dataset, further fine-tuned through a subsequent task distinguishing AD from CN individuals. As mentioned in Section 4.2.2, it is crucial to note that the patient cohorts classified as AD or CN differ from those labeled as sMCI or pMCI, given that the latter are diagnosed as MCI. Consequently, this strategy does not cause data leakage problems.

First, we fine-tuned our model with VGG16 pre-trained on ImageNet in the Feature Extraction module on the entire AD vs. CN dataset. We froze the first 50% of the layers as this configuration provided the best results in Table 5. The final model pre-trained on the AD vs. CN task was selected using 10% of the dataset as a validation set. This step provides a model in which the first convolutional part is highly specialized in extracting fine-grained low-level feature representations and the rest in analyzing AD-related high-level patterns. Second, we ulteriorly fine-tuned the AD pre-trained model on the sMCI vs. pMCI classification task at a 36-month forecast interval. The learning rate for this pool of experiments was reduced to $1\times 10^{-5}$ to fine-tune the pre-trained model weights, ensuring that the disease knowledge is preserved during the training process. Table 6 demonstrates the efficacy of this approach in this task. With baseline ImageNet transfer learning and 50% of the network layers frozen, the model achieved an ACC of 0.4822 and an MCC of 0.065. The MCC value of 0.065 indicates that the model could not find useful correlations to distinguish the two cases. Incorporating AD vs. CN knowledge with a 50% freezing resulted in a marked enhancement across metrics: an ACC of 0.703 and an MCC of 0.396. Advancing to a 75% freezing of the network layers under the double transfer learning paradigm, we observed a peak performance of an ACC of 0.725, SPE of 0.763, SEN of 0.678, and an MCC of 0.443.

5.2.5 Comparison to state-of-the-art methods

Table 7: Comparison of performance against other approaches in AD vs. CN task and sMCI vs pMCI

Networks	AD vs. CN				sMCI vs. pMCI
Networks	ACC	SPE	SEN	MCC	ACC	SPE	SEN	MCC
Attention Transformer (Altay et al., 2021)	0.826	0.914	0.717	0.651	0.623	0.665	0.4873	0.238
AwareNet Diagnosis (Wang et al., 2024)	0.832	0.875	0.778	0.659	0.4841	0.774	0.258	0.039
Ours	0.856	0.910	0.792	0.712	0.725	0.763	0.678	0.443
Majority Voting	0.804	0.897	0.688	0.605	0.614	0.601	0.629	0.229
Attention-Guided Majority Voting	0.843	0.894	0.780	0.683	0.633	0.624	0.643	0.266
Majority Voting 3D	0.836	0.867	0.797	0.667	0.629	0.653	0.601	0.254

We performed a comparison with current 2D slice attention-based state-of-the-art models. These methods include Attention Transformer (Altay et al., 2021) and AwareNet (Wang et al., 2024). Attention Transformer uses a multi-head self-attention to perform a cross-attention operation resulting in a slice feature maps fusion. AwareNet represents the diagnosis network of a joint learning framework that uses a slice-aware module and a slice-shift module. The slice-aware module leverages the attention mechanism to interpret significant slices and regions. While we have utilized the official implementation of AwareNet, as it is available and provided by the authors, we have implemented the architecture for the Attention Transformer ourselves based on the paper details, as the authors did not provide the official implementation. For both methods, the hyperparameters were optimized to maximize performance. As in our case, Attention Transformer obtained the best results by freezing 50% of the base network pre-trained on ImageNet and a batch size of 8. The AwareNet network was trained from scratch with a batch size of 2. The network is initialized by the Kaiming method (He et al., 2015) and trained using the Adam optimization algorithm with $\beta_{1}=0.48$ and $\beta_{2}=0.999$ . In both cases, the learning rate was set to $1\times 10^{-5}$ to prevent overfitting.

In the sMCI vs. pMCI task, we adopted the double transfer learning strategy proposed in this work for the Attention Transformer method. The first 75% of the Attention Transformer base network layers were frozen. Since the AwareNet is trained from scratch on the AD vs. CN task, we froze the first 50% of the backbone.

The results of our comparative analysis are summarized in Table 7. To assess the effectiveness of our attentional module in enhancing diagnostic performance, we introduced three additional approaches: Majority Voting, Attention-Guided Majority Voting, and Majority Voting 3D. We detailed the Majority Voting approach implemented in Section 5.2.1. The Attention-Guided Majority Voting approach utilizes the network trained in the majority voting case. However, predictions are made during testing on a subset of slices identified as most informative by our attentional fusion module. Specifically, we identified a contiguous range where the attention values exceeded the 75th percentile. This range was calculated for the axial plane by averaging the attention weight distribution across the five validation sets of the fold cross-validation, resulting in the range [0, 25]. The results of this method validate the module’s ability to select feature maps with high information content. Finally, the 3D Majority Voting approach replaces the attentional fusion module with an averaging module, where feature maps from individual slices are averaged to form a single feature map for making subject-level decisions. This approach further helps to evaluate the impact of slices’ importance weighting compared to considering all slices as equally important.

Our model demonstrates effectiveness in disease diagnosis and prediction tasks, achieving (i) an accuracy (ACC) of 0.856 and a Matthews correlation coefficient (MCC) of 0.712 in the AD vs. CN comparison, and (ii) an ACC of 0.725 and MCC of 0.443 in the sMCI vs. pMCI task. While AwareNet shows promise in the diagnosis task with an ACC of 0.832 and an MCC of 0.659, it struggles in the disease prediction task, yielding an MCC of 0.039, likely due to insufficient data for training. On the other hand, Attention Transformer, leveraging double transfer learning, provides a more balanced alternative, with an MCC of 0.651 in the diagnosis task and an MCC of 0.238 in disease prediction. However, its increased parameterization from self-attention prevents it from bridging the gap with lightweight models like ours and AwareNet in the diagnosis task. Majority voting, lacking a mechanism to learn correlations between sections, underperforms compared to other methods. Incorporating subject-level error evaluation through the Majority Voting 3D approach the MCC improves from 0.605 to 0.667, even surpassing overfitting-prone approaches such as AwareNet and Attention Transformer. Furthermore, leveraging our attentional mechanism to identify informative slices while using networks pre-trained without considering slice relationships yields an MCC of 0.683. This result underscores our model’s ability to select crucial slices for decision-making.

5.3 XAI results

This section analyzes the interpretability of our approach and those proposed in (Wang et al., 2024; Altay et al., 2021). Our XAI method described in Section 4.3 allowed us to produce a 3D attentional map starting from the attentional weight distributions of the axial, coronal, and sagittal planes. The authors of AwareNet (Wang et al., 2024) designed the slice-aware module of this network to extract, as in our case, a distribution of attentional weights capable of summarizing the importance of each slice in the decision-making process. As a result, our approach can produce a 3D map also using the model proposed in (Wang et al., 2024).

A recurring approach to interpreting decisions made by DL models is GradCAM (Selvaraju et al., 2017). GradCAM is a technique for visualizing regions within an image that influence the classification decision of a convolutional neural network. It works by selecting a target convolutional layer, calculating the class score gradients relative to that layer’s feature maps, averaging these gradients, and using them as weights to create a weighted sum of the feature maps. This sum is then passed through a ReLU function to produce a saliency map highlighting the areas that positively impact the class decision. However, GradCAM in this comparison scenario allows the production of 2D saliency maps, one for each slice, which deliver a qualitative but not quantitative result. To allow a fair comparison with other models, we have developed a variant of our XAI approach that allows us to generate 3D maps with GradCAM. The idea is to stack the produced 2D CAMs on top of each other to generate a 3D saliency map for each plane. Given $A$ , $S$ , and $C$ , the 3D salient maps generated for the respective axial, sagittal, and coronal planes, the final 3D salient map is obtained pointwise as:

M[i,j,k]={S}[i,j,k]\cdot{C}[i,j,k]\cdot{A}[i,j,k]

This 3D map can be used to quantitatively evaluate the impact of each brain area in the decision-making process with the metrics proposed in Section 4.3.3. This method also allows us to compare with Attention Transformer model proposed in Altay et al. (2021), which is not designed to yield attentional weights for each slice directly. In our implementation, we applied GradCAM to the last layer of the convolutional backbone. This choice is motivated by the fact that the last convolutional layer retains high-level spatial information that is crucial for localizing salient regions in the input image.

In the following, we show the consistency of the attention weight distributions obtained from our model and AwareNet. Then, we present the results in qualitative and quantitative terms for each model using our XAI method both in the GradCAM case and with the attentional 3D maps.

5.3.1 Attention consistency analysis

Many studies focused on model interpretability by generating saliency maps for randomly chosen test subjects, enabling local qualitative analysis. However, this approach does not support the evaluation of interpretability consistency when the training and testing data vary. To examine this consistency in our approach, we considered the variability of attentional weight distributions across the five different folds used for cross-validation. Our evaluation leverages a clinical hypothesis specific to AD, which is believed to affect similar brain areas across different patients. For each fold, we calculated the weight distributions for each sample in the test set and derived an average distribution specific to that test set. By analyzing these average distributions across the five folds, we could assess whether our interpretability remained consistent, thereby supporting the generalizability of our findings across different subsets of data.

Fig. 5 presents our diagnosis network’s average attentional weight distributions for each fold in the axial, coronal, and sagittal views. Upon examination of the histograms for each view, we observe a remarkable consistency in the distribution shapes across all five folds, indicating that our interpretability approach is stable despite the variation in the train/test set data. Specifically, the axial distributions reveal a consistent concentration of attentional weights around the initial slices. This trend suggests the model’s recurrent focus on the brain’s inferior regions, notably the areas where degenerative changes first manifest in AD, such as the hippocampus. In the coronal view, attentional weights are notably centered, indicating that the model consistently identifies the central part of the brain as significant. This central focus might correspond to the medial temporal lobe, including the hippocampus and the surrounding regions, further substantiating the axial findings. The sagittal view is the only bimodal distribution, suggesting that the model pinpointed symmetrical areas along this plane. We hypothesize that the network was focusing on the hippocampus since it adheres to all the constraints: situated in the inferior part of the brain, centrally located, and symmetrical. The consistency and specificity of these findings across multiple data folds strengthen the argument that our network could reliably identify specific brain regions as a critical biomarker for distinguishing between AD and CN subjects.

We also analyzed the consistencies of AwareNet distributions (Wang et al., 2024) to compare the robustness and interpretability of different attention mechanisms. However, the average distributions produced by this model, as seen in Fig. 6, present sparsely distributed peaks and do not allow the identification of predominant slice ranges. Furthermore, the distributions between the folds are inconsistent. These results indicate that in this case study, AwareNet could not produce consistent attentional weights useful for contributing to the interpretability of its decisions.

5.4 Qualitative and quantitative results

Table 8: Importance measures of 20 largest brain regions computed with 3D Attention maps. Our model on the left (a); AwareNet on the right (b).

Brain Region	$V_{r}$	$\mu_{r}$	$\sigma_{r}$	$A_{max,r}$	$A_{min,r}$	$P_{r}$
Hippocampus left	1562	0.136	0.139	0.762	0.028	0.333
Hippocampus right	1426	0.126	0.133	0.783	0.028	0.304
Parahippocampal left	688	0.129	0.137	0.884	0.028	0.254
Parahippocampal right	534	0.129	0.148	1.000	0.028	0.197
Amygdala left	480	0.097	0.092	0.620	0.028	0.291
Amygdala right	427	0.095	0.087	0.569	0.028	0.259
Inferior Lateral Ventricle right	232	0.113	0.129	0.677	0.028	0.219
Inferior Lateral Ventricle left	212	0.106	0.105	0.589	0.028	0.200
Cerebellum Gray Matter left	208	0.035	0.005	0.052	0.028	0.003
Lateral Orbitofrontal left	194	0.033	0.004	0.045	0.028	0.013
Fusiform right	184	0.045	0.015	0.107	0.028	0.014
Lateral Orbitofrontal right	140	0.034	0.004	0.046	0.028	0.009
Cerebellum Gray Matter right	119	0.034	0.005	0.054	0.028	0.002
Fusiform left	88	0.040	0.010	0.070	0.028	0.007
Entorhinal left	16	0.034	0.003	0.041	0.030	0.005
Ventral Diencephalon left	6	0.029	0.001	0.031	0.028	0.001
Entorhinal right	2	0.033	0.003	0.036	0.031	0.001
-	-	-	-	-	-	-
-	-	-	-	-	-	-
-	-	-	-	-	-	-

(a)

Brain Region	$V_{r}$	$\mu_{r}$	$\sigma_{r}$	$A_{max,r}$	$P_{r}$
Cerebellum Gray Matter - right	192	0.001	0.003	0.022	0.003
Lateral Occipital - right	173	0.003	0.008	0.067	0.008
Cerebellum Gray Matter - left	132	0.001	0.004	0.035	0.002
Lateral Occipital - left	116	0.002	0.011	0.106	0.005
Fusiform - left	93	0.001	0.002	0.009	0.007
Fusiform - right	74	0.001	0.001	0.007	0.005
Hippocampus - right	74	0.004	0.013	0.076	0.016
Lateral Orbitofrontal - right	70	0.000	0.001	0.003	0.005
Entorhinal - right	62	0.009	0.025	0.117	0.019
Ventral Diencephalon - right	62	0.001	0.003	0.014	0.010
Hippocampus - left	55	0.003	0.012	0.088	0.012
Brainstem - right	53	0.002	0.010	0.076	0.003
Lingual - right	48	0.000	0.001	0.003	0.004
Lateral Orbitofrontal - left	47	0.000	0.001	0.004	0.003
Parahippocampal - right	47	0.000	0.000	0.002	0.017
Cerebellum White Matter - right	36	0.000	0.000	0.002	0.003
Ventral Diencephalon - left	35	0.002	0.005	0.022	0.006
Entorhinal - left	35	0.011	0.042	0.186	0.011
Amygdala - right	35	0.000	0.000	0.002	0.021
Superior Frontal - right	27	0.000	0.000	0.000	0.001

(b)

Table 9: Importance measures of 20 largest brain regions computed with 3D GradCAM maps. Our model on the left (a); Attention Transformer on the right (b).

Brain Region	$V_{r}$	$\mu_{r}$	$\sigma_{r}$	$M_{max,r}$	$M_{min,r}$	$P_{r}$
Hippocampus left	1722	0.874	0.057	0.992	0.773	0.367
Hippocampus right	965	0.835	0.041	0.963	0.773	0.206
Inferior Lateral Ventricle left	648	0.861	0.054	1.000	0.773	0.612
Inferior Lateral Ventricle right	435	0.830	0.038	0.944	0.773	0.411
Superior Temporal left	164	0.810	0.022	0.860	0.774	0.006
Amygdala right	146	0.795	0.015	0.844	0.774	0.088
Superior Temporal right	143	0.816	0.029	0.888	0.774	0.006
Amygdala left	118	0.795	0.021	0.867	0.773	0.072
Middle Temporal left	96	0.825	0.028	0.873	0.774	0.003
Middle Temporal right	95	0.823	0.028	0.887	0.774	0.003
Parahippocampal left	91	0.829	0.038	0.928	0.774	0.034
Entorhinal left	77	0.825	0.032	0.891	0.774	0.024
Fusiform left	65	0.814	0.033	0.903	0.774	0.005
Insula left	8	0.787	0.009	0.806	0.775	0.001
Insula right	7	0.787	0.014	0.809	0.776	0.001
Parahippocampal right	2	0.773	0.000	0.774	0.773	0.001
Ventral Diencephalon left	2	0.779	0.005	0.784	0.774	0.000
Inferior temporal left	1	0.775	0.000	0.775	0.775	0.000
Putamen left	1	0.776	0.000	0.776	0.776	0.000
-	-	-	-	-	-	-

(a)

Brain Region	$V_{r}$	$\mu_{r}$	$\sigma_{r}$	$M_{max,r}$	$M_{min,r}$	$P_{r}$
Hippocampus - left	1921	0.614	0.089	0.921	0.482	0.409
Hippocampus - right	1728	0.628	0.103	1.000	0.482	0.368
Amygdala - right	409	0.576	0.068	0.779	0.482	0.248
Inferior Lateral Ventricle - right	322	0.625	0.090	0.847	0.482	0.304
Inferior Lateral Ventricle - left	318	0.649	0.088	0.899	0.483	0.300
Amygdala - left	312	0.577	0.065	0.731	0.482	0.189
Parahippocampal - right	176	0.524	0.029	0.599	0.484	0.065
Parahippocampal - left	156	0.527	0.032	0.598	0.482	0.058
Entorhinal - left	132	0.545	0.041	0.651	0.482	0.041
Ventral Diencephalon - left	124	0.574	0.066	0.783	0.483	0.020
Entorhinal - right	106	0.541	0.045	0.663	0.482	0.033
Putamen - left	46	0.549	0.040	0.628	0.484	0.007
Ventral Diencephalon - right	38	0.537	0.050	0.687	0.483	0.006
Pallidum - left	23	0.519	0.025	0.583	0.485	0.014
Fusiform - left	5	0.502	0.013	0.522	0.487	0.000
Fusiform - right	2	0.484	0.000	0.484	0.483	0.000
Putamen - right	2	0.495	0.001	0.496	0.495	0.000
-	-	-	-	-	-	-
-	-	-	-	-	-	-
-	-	-	-	-	-	-

(b)

This subsection examines the visual results and quantitative analysis concerning the brain areas emphasized by each model. Fig. 7, on the left, displays the attentional weight distributions across the three planes, averaged over all five folds of the cross-validation. This averaging provides a comprehensive view of the data distribution across all images in the dataset. Starting from the entire dataset distributions, the 3D attentional map was created as detailed in Section 4.3. The averaged 3D map was enhanced by a factor of 10 and overlaid on the MNI152 template, which is representative of a typical patient’s brain. Combining this template with its corresponding atlas facilitates the identification of regions that, on average, received attention from the models. The right side of Fig. 7 shows the explainable MRI generated. The visual representation also indicates that the network targets the medial temporal lobe region, as suggested by the distributions. This result is confirmed by the quantitative analysis shown in Table 8, which reports the metrics for the 20 more extensive regions selected by our model and by AwareNet. As shown in Table 8(a), the three largest regions focused by our diagnosis model are the hippocampus, the parahippocampus, and the amygdala. In contrast, the 3D attentional map generated with AwareNet appears to focus on different regions. The first 3 regions highlighted are Cerebellum Gray Matter, Lateral Occipital, and Fusiform. The right part of the hippocampus appears only after them. From this result, it is also possible to note that with the same 99.9 percentile threshold for binarization, our model highlights a much more localized region. Specifically, our model selects 17 regions with a strong concentration in the top 14. On the contrary, AwareNet highlights 88 regions, 68 of which have been omitted in the Table 8.

We also examined the interpretability of the Attention Transformer model proposed in (Altay et al., 2021). For comparison, we created 2D saliency maps using the GradCAM++ algorithm (Chattopadhay et al., 2018) across all three views and all five test sets from the different folds. These maps were combined to create a unified average 3D saliency map as outlined in Section 7. This method was also applied to generate equivalent results from the saliency maps produced using our diagnostic model. As shown in Fig. 8, the 2D maps produced by our method are generally sparser compared to those from the Attention Transformer. The method introduced in (Altay et al., 2021) employs a cross-attention mechanism via a Multi-Head. Therefore, it is plausible that the Multi-Head allows to generate 2D maps that align more meaningfully within the 3D context. This finding suggests that our approach may consider less contextual information from adjacent slices unless it is particularly relevant. In contrast, the cross-attention in the Attention Transformer might enable a more cohesive representation of the entire 3D space by considering both the local features within slices and their contextual interactions. This behavior is further clarified by creating 3D maps and overlaying them on the MNI152 template, similar to the attentional maps. As illustrated in Fig. 9 on the left, the 3D maps created using our model cover a broader and less concentrated area compared to those produced by the Attention Transformer, which are shown on the right. However, similar to the 3D attentional maps, both models predominantly focus on an area surrounding the hippocampus. As detailed in Table 9, both models identify key areas, such as the hippocampus and the amygdala, as significant. However, the emphasis on other regions varies markedly between the two. In the attention transformer model, there is a noticeable focus on the inferior lateral ventricles and the parahippocampal region, areas less emphasized by our model in this case. This result indicates that the Attention Transformer using cross-attention in combination with GradCAM can produce results similar to those obtained by our method with a 3D attentional map. As seen in Tables 8(a) and 9(b), the first four areas on which our model focused with our approach are the same as those focused on by Attention Transformer with GradCAM. In contrast, our model with GradCAM shows broader involvement with regions such as the superior and middle temporal areas, which are not as prominent in the other cases.

6 Discussion

This section discusses the main findings of our study, including the prevalence of VGG architectures, the efficacy of double transfer learning, the effectiveness of our method against state-of-the-art ones, and the implications of our XAI in the context of medical imaging and diagnostics. Additionally, we thoroughly discuss the limitations and potential areas for further investigation in our study.

6.1 VGG for AD diagnosis

Different studies on AD have suggested using models from the VGG architecture family (Hu et al., 2023a, b; Mehmood et al., 2021). The selection of this family is supported by their effective performance in medical imaging, mainly when using transfer learning techniques (Mehmood et al., 2021). While it is difficult to justify the clear superiority of this relatively simple architecture over more sophisticated ones such as EfficientNet and ResNet, its structure and pre-training stage may be the reason. The VGG models, for instance, VGG16, include a large number of parameters, approximately 138 million, with about 110 million dedicated to the classification layers and the remaining 28 million to convolutional layers. This structure suggests that most of the specific pattern recognition capabilities for class distinction learned from ImageNet likely reside in the classifier module. Therefore, the convolutional part may have acquired more general and transferable features than those in ResNet and EfficientNet, which rely on fully connected layers for classification.

6.2 Transfer domain knowledge for AD prediction

This study’s double transfer learning strategy enhanced model performance for AD prediction task. The findings indicate that incorporating this additional domain-specific knowledge allows the model to more effectively differentiate subtle variations in brain morphology associated with early MCI progression. The utilization of such a strategy addresses the challenges posed by limited training data, which is a common issue in medical imaging tasks due to privacy concerns and the difficulty of obtaining large annotated datasets. However, it is crucial to note that the model’s performance, while promising, needs further improvement. The modest MCC values indicate that the model, although capable of identifying relevant patterns, may still struggle with generalization across diverse patient profiles or imaging conditions. This limitation underscores the need to further refine the model’s architecture and training regimen, possibly by integrating richer datasets or applying more sophisticated image augmentation techniques to enhance its robustness and clinical applicability. We did not perform an XAI evaluation on this task due to the evident model’s current explanatory limitations, underscored by its modest correlation metric. As the model’s predictive accuracy and reliability improve, integrating advanced XAI techniques to provide deeper insights into the decision-making process will become essential. This advancement will not only increase transparency but could also contribute to the discovery of new structural brain markers that will help predict the disease well in advance.

6.3 Comparison with other attention-based methods

The evaluation of our proposed model against state-of-the-art attention-based techniques reveals several noteworthy insights, particularly in employing a more straightforward, lightweight architecture in medical imaging tasks constrained by limited data availability. Our model’s performance suggests that a meticulously designed, less complex network can rival or surpass more intricate systems in diagnosis and explainability. The successful application of 2D CNNs with attention mechanisms in tasks traditionally reserved for 3D CNNs is particularly intriguing. It suggests that simpler 2D networks can effectively extract and utilize the spatial information necessary for accurate diagnosis with the proper architectural considerations and training strategies. This finding could significantly impact the computational efficiency and accessibility of deep learning models for medical imaging, enabling their deployment in more diverse clinical environments with varying resource availability.

6.4 XAI Evaluation and Interpretability

Our XAI methods analysis has showcased both the potential and the challenges of interpreting DL models. Our findings show that the Attention Transformer model with GradCAM and our model, both with and without GradCAM, consistently highlighted regions such as the hippocampus, parahippocampal gyrus, amygdala and ventricles which are well-documented in literature as being affected by AD (Rao et al., 2022; Van Hoesen et al., 2000; Poulin et al., 2011; Ferrarini et al., 2006). This encouraging consistency could reassure clinicians about the neural network’s focus and diagnostic relevance. Between the two best results obtained, our model, employing a 3D attention map, demonstrated a more localized focus compared to the Attention Transformer model using a 3D GradCAM saliency map. While intuitively a more localized interpretation might seem to offer a clearer insight into the neural network’s diagnostic process, it does not necessarily equate to a more accurate or useful diagnostic tool. The degree of localization is just one of many factors to consider, and more localization does not automatically imply superior interpretability value.

A comprehensive assessment must incorporate clinical expertise to compare XAI methods and models. Neurologists and radiologists play a pivotal role in interpreting these XAI outputs, as their expertise in recognizing AD-specific biomarkers is crucial for validating the clinical relevance of the areas highlighted by the AI models. However, involving clinical experts in the evaluation process presents its own set of challenges. It requires access to a panel of specialists willing to participate in such studies and a methodological framework that allows for averaging their insights to attain statistically significant conclusions. This process can be resource-intensive and difficult to implement across different studies, making it a less feasible option for consistent use. Ideally, metrics should be developed to simplify comparability across models and XAI methods in medical imaging. Such metrics could evaluate the relevance of highlighted regions within the context of a pathology. It should assess the localization and focus of maps and correlate these aspects with the known pathological features of the disease. By establishing these metrics, it could easily assess the comparison and validation of XAI approaches.

6.5 Limitations

In this section, we discuss several limitations identified in our study that warrant further exploration to enhance the robustness and applicability of our model. Firstly, the effectiveness of our method was assessed within a relatively limited dataset scenario, utilizing the ADNI1: Complete 1Yr 1.5T collection. While the model demonstrated promising results in this context, its performance against state-of-the-art models in larger datasets remains to be determined. Expanding the scope of testing to include a broader range of datasets with varying characteristics could help establish the scalability and general effectiveness of the proposed method.

Secondly, our XAI approach requires training three separate models, each tailored to one of the principal anatomical planes: axial, coronal, and sagittal. This requirement increases the computational demand and complexity of the training process since each model must be individually optimized and evaluated. Lastly, our approach lacks of integration between the models corresponding to different anatomical planes. Currently, each model is trained independently, without considering the inherent correlations between these planes. This segmentation of the training process potentially overlooks critical spatial information that could be utilized to enhance diagnostic accuracy and explainability. Future improvements could involve the development of integrated multi-view learning strategies, which could simultaneously process and cross-validate information across different planes, offering a more comprehensive understanding of the imaging data.

To evaluate the consistency of interpretability in our approach, we examined the variability of attentional weight distributions across the five different folds used for cross-validation. Our evaluation leverages a clinical hypothesis specific to AD, which is believed to affect similar brain areas across different patients (Rao et al., 2022). However, we acknowledge that this assumption of spatial consistency may not hold for other diseases, e.g. cancer, where affected areas can vary significantly among patients. Therefore, this evaluation cannot be generalized a priori but is applicable in specific cases like AD, where the disease is supposed to impact similar regions in different patients.

7 Conclusions

In this study, we presented an innovative method for AD diagnosis using pre-trained 2D CNNs to classify 3D volumes. Our approach integrates an attention mechanism to enhance the interpretability and accuracy of diagnosing AD and differentiating stable from progressive mild cognitive impairment. Our method outperformed traditional baseline methods, achieving an MCC of 0.712 for distinguishing AD from CN subjects and 0.442 for sMCI from pMCI subjects. These results demonstrate the capability of our approach to classify these conditions effectively. A novel aspect of our work is the enhancement of model explainability. We successfully implemented voxel-level attention activation maps highlighting specific brain areas implicated in AD, such as the hippocampus, amygdala, parahippocampus, and inferior lateral ventricles. These regions are known to be crucial in AD pathology, making our model’s outputs medically relevant. Our approach also includes a double transfer learning strategy that leverages pre-trained models to improve performance on limited datasets. This strategy utilizes knowledge transfer from the AD vs. CN task to enhance the model’s sensitivity to subtle morphological changes associated with the progression of MCI, which are often hard to detect. In conclusion, our method advances diagnostic capabilities using 3D MRI scans for early AD detection and tries to address the critical need for explainability in medical imaging AI applications. By providing insights into the model’s decision-making process, our approach helps bridge the gap between AI tools and clinical usability, making it a valuable asset for neurodegenerative disease research and potentially aiding in the clinical diagnosis and monitoring of AD progression.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve language and readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Acknowledgements

Project ECS 0000024 “Ecosistema dell’innovazione - Rome Technopole” financed by EU in NextGenerationEU plan through MUR Decree n. 1051 23.06.2022 PNRR Missione 4 Componente 2 Investimento 1.5 - CUP H33C22000420001

References

Altay et al. (2021) Altay, F., Sánchez, G.R., James, Y., Faraone, S.V., Velipasalar, S., Salekin, A., 2021. Preclinical stage alzheimer’s disease detection using magnetic resonance image scans, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 15088–15097.
Avants et al. (2008) Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C., 2008. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12, 26–41.
Avants et al. (2014) Avants, B.B., Tustison, N.J., Stauffer, M., Song, G., Wu, B., Gee, J.C., 2014. The insight toolkit image registration framework. Frontiers in neuroinformatics 8, 44.
Basaia et al. (2019) Basaia, S., Agosta, F., Wagner, L., Canu, E., Magnani, G., Santangelo, R., Filippi, M., Initiative, A.D.N., et al., 2019. Automated classification of alzheimer’s disease and mild cognitive impairment using a single mri and deep neural networks. NeuroImage: Clinical 21, 101645.
Brookmeyer et al. (2007) Brookmeyer, R., Johnson, E., Ziegler-Graham, K., Arrighi, H.M., 2007. Forecasting the global burden of alzheimer’s disease. Alzheimer’s & dementia 3, 186–191.
Carcagnì et al. (2023) Carcagnì, P., Leo, M., Del Coco, M., Distante, C., De Salve, A., 2023. Convolution neural networks and self-attention learners for alzheimer dementia diagnosis from brain mri. Sensors 23, 1694.
Chattopadhay et al. (2018) Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N., 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. URL: http://dx.doi.org/10.1109/WACV.2018.00097, doi:10.1109/wacv.2018.00097.
Chicco and Jurman (2020) Chicco, D., Jurman, G., 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics 21, 1–13.
Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
Duchesne et al. (2008) Duchesne, S., Caroli, A., Geroldi, C., Barillot, C., Frisoni, G.B., Collins, D.L., 2008. Mri-based automated computer classification of probable ad versus normal controls. IEEE transactions on medical imaging 27, 509–520.
Ebrahimi et al. (2021) Ebrahimi, A., Luo, S., Chiong, R., Initiative, A.D.N., et al., 2021. Deep sequence modelling for alzheimer’s disease detection using mri. Computers in Biology and Medicine 134, 104537.
Ewers et al. (2011) Ewers, M., Sperling, R.A., Klunk, W.E., Weiner, M.W., Hampel, H., 2011. Neuroimaging markers for the prediction and early diagnosis of alzheimer’s disease dementia. Trends in neurosciences 34, 430–442.
Falahati et al. (2014) Falahati, F., Westman, E., Simmons, A., 2014. Multivariate data analysis and machine learning in alzheimer’s disease with a focus on structural magnetic resonance imaging. Journal of Alzheimer’s disease 41, 685–708.
Feng et al. (2022) Feng, X., Provenzano, F.A., Small, S.A., Initiative, A.D.N., 2022. A deep learning mri approach outperforms other biomarkers of prodromal alzheimer’s disease. Alzheimer’s Research & Therapy 14, 45.
Ferrarini et al. (2006) Ferrarini, L., Palm, W.M., Olofsen, H., van Buchem, M.A., Reiber, J.H., Admiraal-Behloul, F., 2006. Shape differences of the brain ventricles in alzheimer’s disease. Neuroimage 32, 1060–1069.
Fonov et al. (2011) Fonov, V., Evans, A.C., Botteron, K., Almli, C.R., McKinstry, R.C., Collins, D.L., Group, B.D.C., et al., 2011. Unbiased average age-appropriate atlases for pediatric studies. Neuroimage 54, 313–327.
Fonov et al. (2009) Fonov, V.S., Evans, A.C., McKinstry, R.C., Almli, C.R., Collins, D., 2009. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. NeuroImage 47, S102.
Frisoni et al. (2010) Frisoni, G.B., Fox, N.C., Jack Jr, C.R., Scheltens, P., Thompson, P.M., 2010. The clinical use of structural mri in alzheimer disease. Nature Reviews Neurology 6, 67–77.
Goenka et al. (2022) Goenka, N., Goenka, A., Tiwari, S., 2022. Patch-based classification for alzheimer disease using smri, in: 2022 International Conference on Emerging Smart Computing and Informatics (ESCI), IEEE. pp. 1–5.
Gorgolewski et al. (2016) Gorgolewski, K.J., Auer, T., Calhoun, V.D., Craddock, R.C., Das, S., Duff, E.P., Flandin, G., Ghosh, S.S., Glatard, T., Halchenko, Y.O., et al., 2016. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data 3, 1–9.
Haller et al. (2011) Haller, S., Lovblad, K.O., Giannakopoulos, P., 2011. Principles of classification analyses in mild cognitive impairment (mci) and alzheimer disease. Journal of Alzheimer’s Disease 26, 389–394.
He et al. (2015) He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034.
He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society. pp. 770–778. doi:10.1109/CVPR.2016.90.
Hon and Khan (2017) Hon, M., Khan, N.M., 2017. Towards alzheimer’s disease classification through transfer learning, in: 2017 IEEE International conference on bioinformatics and biomedicine (BIBM), IEEE. pp. 1166–1169.
Hu et al. (2023a) Hu, Z., Li, Y., Wang, Z., Zhang, S., Hou, W., Initiative, A.D.N., et al., 2023a. Conv-swinformer: Integration of cnn and shift window attention for alzheimer’s disease classification. Computers in Biology and Medicine 164, 107304.
Hu et al. (2023b) Hu, Z., Wang, Z., **, Y., Hou, W., 2023b. Vgg-tswinformer: Transformer-based deep learning model for early alzheimer’s disease prediction. Computer Methods and Programs in Biomedicine 229, 107291.
Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.
Jack et al. (2003) Jack, C., Slomkowski, M., Gracon, S., Hoover, T., Felmlee, J., Stewart, K., Xu, Y., Shiung, M., O’brien, P., Cha, R., et al., 2003. Mri as a biomarker of disease progression in a therapeutic trial of milameline for ad. Neurology 60, 253–260.
Jenkinson et al. (2012) Jenkinson, M., Beckmann, C.F., Behrens, T.E., Woolrich, M.W., Smith, S.M., 2012. Fsl. Neuroimage 62, 782–790.
Jetley et al. (2018) Jetley, S., Lord, N.A., Lee, N., Torr, P.H., 2018. Learn to pay attention. arXiv preprint arXiv:1804.02391 .
** et al. (2019) **, D., Xu, J., Zhao, K., Hu, F., Yang, Z., Liu, B., Jiang, T., Liu, Y., 2019. Attention-based 3d convolutional network for alzheimer’s disease diagnosis and biomarkers exploration, in: 2019 IEEE 16Th international symposium on biomedical imaging (ISBI 2019), IEEE. pp. 1047–1051.
Kang et al. (2021) Kang, W., Lin, L., Zhang, B., Shen, X., Wu, S., Initiative, A.D.N., et al., 2021. Multi-model and multi-slice ensemble learning architecture based on 2d convolutional neural networks for alzheimer’s disease diagnosis. Computers in Biology and Medicine 136, 104678.
Klaiber et al. (2021) Klaiber, M., Sauter, D., Baumgartl, H., Buettner, R., 2021. A systematic literature review on transfer learning for 3d-cnns, in: 2021 international joint conference on neural networks (IJCNN), IEEE. pp. 1–10.
Klöppel et al. (2008) Klöppel, S., Stonnington, C.M., Chu, C., Draganski, B., Scahill, R.I., Rohrer, J.D., Fox, N.C., Jack Jr, C.R., Ashburner, J., Frackowiak, R.S., 2008. Automatic classification of mr scans in alzheimer’s disease. Brain 131, 681–689.
Kushol et al. (2022) Kushol, R., Masoumzadeh, A., Huo, D., Kalra, S., Yang, Y.H., 2022. Addformer: Alzheimer’s disease detection from structural mri using fusion transformer, in: 2022 IEEE 19th International Symposium On Biomedical Imaging (ISBI), IEEE. pp. 1–5.
Lam et al. (2013) Lam, B., Masellis, M., Freedman, M., Stuss, D.T., Black, S.E., 2013. Clinical, imaging, and pathological heterogeneity of the alzheimer’s disease syndrome. Alzheimer’s research & therapy 5, 1–14.
Liu et al. (2023) Liu, F., Yuan, S., Li, W., Xu, Q., Sheng, B., 2023. Patch-based deep multi-modal learning framework for alzheimer’s disease diagnosis using multi-view neuroimaging. Biomedical Signal Processing and Control 80, 104400.
Loewenstein et al. (2006) Loewenstein, D.A., Acevedo, A., Agron, J., Issacson, R., Strauman, S., Crocco, E., Barker, W.W., Duara, R., 2006. Cognitive profiles in alzheimer’s disease and in mild cognitive impairment of different etiologies. Dementia and geriatric cognitive disorders 21, 309–315.
Mehmood et al. (2021) Mehmood, A., Yang, S., Feng, Z., Wang, M., Ahmad, A.S., Khan, R., Maqsood, M., Yaqub, M., 2021. A transfer learning approach for early diagnosis of alzheimer’s disease on mri images. Neuroscience 460, 43–52.
Nicoll et al. (2019) Nicoll, J.A., Buckland, G.R., Harrison, C.H., Page, A., Harris, S., Love, S., Neal, J.W., Holmes, C., Boche, D., 2019. Persistent neuropathological effects 14 years following amyloid- $\beta$ immunization in alzheimer’s disease. Brain 142, 2113–2126.
Pan et al. (2020) Pan, D., Zeng, A., Jia, L., Huang, Y., Frizzell, T., Song, X., 2020. Early detection of alzheimer’s disease using magnetic resonance imaging: a novel approach combining convolutional neural networks and ensemble learning. Frontiers in neuroscience 14, 259.
Park et al. (2023) Park, C., Jung, W., Suk, H.I., 2023. Deep joint learning of pathological region localization and alzheimer’s disease diagnosis. Scientific reports 13, 11664.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32.
Poulin et al. (2011) Poulin, S.P., Dautoff, R., Morris, J.C., Barrett, L.F., Dickerson, B.C., Initiative, A.D.N., et al., 2011. Amygdala atrophy is prominent in early alzheimer’s disease and relates to symptom severity. Psychiatry Research: Neuroimaging 194, 7–13.
Qiu et al. (2020) Qiu, S., Joshi, P.S., Miller, M.I., Xue, C., Zhou, X., Karjadi, C., Chang, G.H., Joshi, A.S., Dwyer, B., Zhu, S., et al., 2020. Development and validation of an interpretable deep learning framework for alzheimer’s disease classification. Brain 143, 1920–1933.
Rao et al. (2021) Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J., 2021. Global filter networks for image classification. Advances in neural information processing systems 34, 980–993.
Rao et al. (2022) Rao, Y.L., Ganaraja, B., Murlimanju, B., Joy, T., Krishnamurthy, A., Agrawal, A., 2022. Hippocampus and its involvement in alzheimer’s disease: a review. 3 Biotech 12, 55.
Rathore et al. (2017) Rathore, S., Habes, M., Iftikhar, M.A., Shacklett, A., Davatzikos, C., 2017. A review on neuroimaging-based classification studies and associated feature extraction methods for alzheimer’s disease and its prodromal stages. NeuroImage 155, 530–548.
Routier et al. (2021) Routier, A., Burgos, N., Díaz, M., Bacci, M., Bottani, S., El-Rifai, O., Fontanella, S., Gori, P., Guillon, J., Guyot, A., et al., 2021. Clinica: An open-source software platform for reproducible clinical neuroscience studies. Frontiers in Neuroinformatics 15, 689675.
Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252.
Samper-González et al. (2018) Samper-González, J., Burgos, N., Bottani, S., Fontanella, S., Lu, P., Marcoux, A., Routier, A., Guillon, J., Bacci, M., Wen, J., et al., 2018. Reproducible evaluation of classification methods in alzheimer’s disease: Framework and application to mri and pet data. NeuroImage 183, 504–521.
Schlemper et al. (2019) Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueckert, D., 2019. Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207.
Selvaraju et al. (2017) Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, pp. 618–626.
Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. URL: http://arxiv.longhoe.net/abs/1409.1556, arXiv:1409.1556.
Singh et al. (2020) Singh, A., Sengupta, S., Lakshminarayanan, V., 2020. Explainable deep learning models in medical image analysis. Journal of imaging 6, 52.
Smith (2002) Smith, S.M., 2002. Fast robust automated brain extraction. Human brain map** 17, 143–155.
Tan and Le (2021) Tan, M., Le, Q.V., 2021. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298 .
Tanveer et al. (2021) Tanveer, M., Rashid, A.H., Ganaie, M., Reza, M., Razzak, I., Hua, K.L., 2021. Classification of alzheimer’s disease using ensemble of deep neural networks trained through transfer learning. IEEE Journal of Biomedical and Health Informatics 26, 1453–1463.
Tustison et al. (2010) Tustison, N.J., Avants, B.B., Cook, P.A., Zheng, Y., Egan, A., Yushkevich, P.A., Gee, J.C., 2010. N4itk: Improved n3 bias correction. IEEE Transactions on Medical Imaging 29, 1310–1320. doi:10.1109/TMI.2010.2046908.
Van Hoesen et al. (2000) Van Hoesen, G.W., Augustinack, J.C., Dierking, J., Redman, S.J., Thangavel, R., 2000. The parahippocampal gyrus in alzheimer’s disease: clinical and preclinical neuroanatomical correlates. Annals of the New York Academy of Sciences 911, 254–274.
Van der Velden et al. (2022) Van der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A., 2022. Explainable artificial intelligence (xai) in deep learning-based medical image analysis. Medical Image Analysis 79, 102470.
Vemuri et al. (2009) Vemuri, P., Wiste, H., Weigand, S., Shaw, L., Trojanowski, J., Weiner, M., Knopman, D., Petersen, R., Jack, C., et al., 2009. Mri and csf biomarkers in normal, mci, and ad subjects: diagnostic discrimination and cognitive correlations. Neurology 73, 287–293.
Venugopalan et al. (2021) Venugopalan, J., Tong, L., Hassanzadeh, H.R., Wang, M.D., 2021. Multimodal deep learning models for early detection of alzheimer’s disease stage. Scientific reports 11, 3254.
Viswan et al. (2024) Viswan, V., Shaffi, N., Mahmud, M., Subramanian, K., Hajamohideen, F., 2024. Explainable artificial intelligence in alzheimer’s disease classification: A systematic review. Cognitive Computation 16, 1–44.
Wang et al. (2024) Wang, C., Piao, S., Huang, Z., Gao, Q., Zhang, J., Li, Y., Shan, H., Initiative, A.D.N., et al., 2024. Joint learning framework of cross-modal synthesis and diagnosis for alzheimer’s disease by mining underlying shared modality information. Medical Image Analysis 91, 103032.
Wang et al. (2018) Wang, S.H., Phillips, P., Sui, Y., Liu, B., Yang, M., Cheng, H., 2018. Classification of alzheimer’s disease based on eight-layer convolutional neural network with leaky rectified linear unit and max pooling. Journal of medical systems 42, 1–11.
Wen et al. (2020) Wen, J., Thibeau-Sutre, E., Diaz-Melo, M., Samper-González, J., Routier, A., Bottani, S., Dormont, D., Durrleman, S., Burgos, N., Colliot, O., et al., 2020. Convolutional neural networks for classification of alzheimer’s disease: Overview and reproducible evaluation. Medical image analysis 63, 101694.
Winblad et al. (2016) Winblad, B., Amouyel, P., Andrieu, S., Ballard, C., Brayne, C., Brodaty, H., Cedazo-Minguez, A., Dubois, B., Edvardsson, D., Feldman, H., et al., 2016. Defeating alzheimer’s disease and other dementias: a priority for european science and society. The Lancet Neurology 15, 455–532.
Wong (2020) Wong, W., 2020. Economic burden of alzheimer disease and managed care considerations. The American journal of managed care 26, S177–S183.
Wu et al. (2022) Wu, Y., Zhou, Y., Zeng, W., Qian, Q., Song, M., 2022. An attention-based 3d cnn with multi-scale integration block for alzheimer’s disease classification. IEEE Journal of Biomedical and Health Informatics 26, 5665–5673.
Wyman et al. (2013) Wyman, B.T., Harvey, D.J., Crawford, K., Bernstein, M.A., Carmichael, O., Cole, P.E., Crane, P.K., DeCarli, C., Fox, N.C., Gunter, J.L., et al., 2013. Standardization of analysis sets for reporting results from adni mri data. Alzheimer’s & Dementia 9, 332–337.
Yarkoni et al. (2019) Yarkoni, T., Markiewicz, C.J., de la Vega, A., Gorgolewski, K.J., Salo, T., Halchenko, Y.O., McNamara, Q., DeStasio, K., Poline, J.B., Petrov, D., Hayot-Sasson, V., Nielson, D.M., Carlin, J., Kiar, G., Whitaker, K., DuPre, E., Wagner, A., Tirrell, L.S., Jas, M., Hanke, M., Poldrack, R.A., Esteban, O., Appelhoff, S., Holdgraf, C., Staden, I., Thirion, B., Kleinschmidt, D.F., Lee, J.A., di Castello, M.V.O., Notter, M.P., Blair, R., 2019. Pybids: Python tools for bids datasets. Journal of Open Source Software 4, 1294. URL: https://doi.org/10.21105/joss.01294, doi:10.21105/joss.01294.
Yarkoni et al. (2023) Yarkoni, T., Markiewicz, C.J., de la Vega, A., Gorgolewski, K.J., Salo, T., Halchenko, Y.O., Papadopoulos Orfanos, D., Esteban, O., Gau, R., McNamara, Q., DeStasio, K., Poline, J.B., Johnson, H., Kalenkovich, E., Petrov, D., Nielson, D.M., Kent, J., Kent, J.D., Appelhoff, S., Van Dyken, P., Goncalves, M., Bansal, S., Hayot-Sasson, V., Carlin, J., Kiar, G., Whitaker, K., Ghosh, S., Wagner, A., DuPre, E., Janke, A., Ivanov, A., Gillman, A., Wennberg, J., Tirrell, L.S., Tilley II, S., Li, A., Legarreta, J.H., Waller, L., Jas, M., Hanke, M., Guenther, N., Poldrack, R., Rokem, A., Boulay, C., Mumford, J., Thual, A., Holdgraf, C., Staden, I., Staph, J.A., Drew, W., Sinha, A., Rovai, A., Adebimpe, A., Thirion, B., Kleinschmidt, D.F., Dickie, E.W., Ben-Zvi, G., Lee, J.A., Kruper, J., Visconti di Oleggio Castello, M., Notter, M.P., Roca, P., Blair, R., Pati, S., Sundaravadivelu, S., 2023. Pybids: Python tools for bids datasets. URL: https://doi.org/10.5281/zenodo.8253830, doi:10.5281/zenodo.8253830.
Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks? Advances in neural information processing systems 27.
Zhang et al. (2022) Zhang, Y., Teng, Q., Liu, Y., Liu, Y., He, X., 2022. Diagnosis of alzheimer’s disease based on regional attention with smri gray matter slices. Journal of neuroscience methods 365, 109376.
Zhou et al. (2023) Zhou, Q., Wang, J., Yu, X., Wang, S., Zhang, Y., 2023. A survey of deep learning for alzheimer’s disease. Machine Learning and Knowledge Extraction 5, 611–668.