BrainMAE: A Region-aware Self-supervised Learning Framework for Brain Signals

Yifan Yang¹, Yutong Mao¹, Xufu Liu¹, Xiao Liu^1,2
¹Department of Biomedical Engineering, The Pennsylvania State University
²Institute for Computational and Data Sciences, The Pennsylvania State University
{yzy161, yzm5278, xbl5292, xxl213}@psu.edu

Abstract

The human brain is a complex, dynamic network, which is commonly studied using functional magnetic resonance imaging (fMRI) and modeled as network of Regions of interest (ROIs) for understanding various brain functions. Recent studies utilize deep learning approaches to learn the brain network representation based on functional connectivity (FC) profile, broadly falling into two main categories. The Fixed-FC approaches, utilizing the FC profile which represents the linear temporal relation within the brain network, are limited by failing to capture informative brain temporal dynamics. On the other hand, the Dynamic-FC approaches, modeling the evolving FC profile over time, often exhibit less satisfactory performance due to challenges in handling the inherent noisy nature of fMRI data.

To address these challenges, we propose Brain Masked Auto-Encoder (BrainMAE) for learning representations directly from fMRI time-series data. Our approach incorporates two essential components—a region-aware graph attention mechanism designed to capture the relationships between different brain ROIs, and a novel self-supervised masked autoencoding framework for effective model pre-training. These components enable the model to capture rich temporal dynamics of brain activity while maintaining resilience to inherent noise in fMRI data. Our experiments demonstrate that BrainMAE consistently outperforms established baseline methods by significant margins in four distinct downstream tasks. Finally, leveraging the model’s inherent interpretability, our analysis of model-generated representations reveals findings that resonate with ongoing research in the field of neuroscience. The code is available at https://anonymous.4open.science/r/fMRI-State-F014.

1 Introduction

Functional magnetic resonance imaging (fMRI) is a non-invasive neuroimaging technique used to measure brain activity. Due to its high spatial and temporal resolution, fMRI has become a cornerstone of neuroscience research, enabling the study of brain functions [8, 5]. A general practice is to extract some brain region-of-interests (ROIs) and conceptualize the brain as a network composed of these ROIs. The connectivity between these ROIs is defined based on their linear temporal relationships, i.e. the correlation between the ROI signals. This profile of functional connectivity (FC) serves as a valuable biomarker, offering insights into the study of brain diseases [15, 52], aging [14, 12], and behaviors [44], and has emerged as a key tool for understanding brain function.

Recent advances have sought to leverage the rich information contained in fMRI data by applying deep learning techniques, capitalizing on their capacity for high-level representation learning. The prevalent approach in this domain involves employing Graph Neural Networks (GNNs) to extract intricate brain network representations, which can then be applied to tasks such as decoding human traits or diagnosing diseases [22, 23, 24]. These models can be broadly classified into two categories based on their treatment of temporal dynamics within the data. The first category, referred to as Fixed-FC models, relies on FC matrices computed from the entire time series of fMRI data. In contrast, the second category, known as Dynamic-FC models, takes into account the temporal evolution of brain networks. These models compute FC within temporal windows using a sliding-window approach or directly learn FC patterns from the time-series data [21]. However, both types of models exhibit certain limitations, primarily due to the unique characteristics of fMRI data.

For Fixed-FC models, depending solely on the FC profile can limit their representational capacity, as it overlooks the valuable temporal dynamics inherent in brain activities. These dynamics are often considered essential for capturing the brain’s evolving states, and failing to account for them results in suboptimal brain network representations [20, 36, 31]. However, current Dynamic-FC models that consider dynamic properties, often underperform Fixed-FC models [24]. This discrepancy can be attributed to the intrinsic noisy nature of fMRI signals, as modeling temporal dynamics may amplify noise to some extent, whereas Fixed-FC approaches tend to mitigate noise by summarizing FC matrices using the entire time series [20, 51, 3]. Furthermore, FC has been shown to be sensitive to denoising preprocessing pipelines in neuroscience studies, potentially limiting the generalizability of model representations to differently preprocessed fMRI data [34, 26, 48].

In response to these challenges, we propose Brain Masked Auto-Encoder (BrainMAE), a novel approach for learning representations from fMRI data. Our approach captures the rich temporal dynamics present in fMRI data and mitigates the impact of inherent noise through two essential components. First, drawing inspiration from the practice of word embeddings in natural language processing (NLP) [13], we maintain a shared embedding vector for each brain ROI. These ROI embeddings are learned globally using fMRI data from all individuals in the dataset, enabling us to obtain rich and robust representations of each ROI. Based on these ROI embeddings, we introduce a region-aware attention mechanism, adhering to the constrained nature of brain network connectivity, thus providing a valuable constraint in feature learning. Second, we leverage the information contained within the fMRI data by introducing a novel pretraining framework inspired by the concept of masked autoencoding in NLP and computer vision research [7, 19]. This masked autoencoding approach empowers the model to acquire genuine and transferable representations of fMRI time-series data. By integrating these two components, BrainMAE consistently outperforms existing models by significant margins across several distinct downstream tasks. Furthermore, owing to its transformer-based design and inclusion of temporal components, BrainMAE provides interpretable results, shedding light on the insights it learns from the data. Lastly, we evaluate the model-generated representations, revealing intriguing findings that align with ongoing research in the field of neuroscience.

2 Approach

Our approach incorporates two essential components: a novel region-aware graph attention mechanism and a masked autoencoding pretraining framework tailored for fMRI representation learning.

2.1 Region-aware Graph Attention

We motivate our region-aware attention module based on the inherent characteristics of brain ROIs.

•

Functional Specificity. The brain is organized as a distributed system, with each distinct brain region serving a specific and well-defined role in the overall functioning of the brain [35].
•

Functional Connectivity. Different brain regions often are interconnected and collaborate to facilitate complex cognitive functions [5].
•

Inter-Individual Consistency. Brain regions are known to exhibit a relatively consistent functional profile across different individuals. For instance, the primary visual cortex consistently processes visual information in nearly all individuals [25].

Refer to caption — Figure 1: Overview of the Proposed BrainMAE Method. (A). Overall pre-training procedures for BrainMAE. (B). Region-aware Graph Attention. (C). Architecture of proposed TSE modules.

ROI Embeddings. There is a notable similarity in the representation properties between brain ROIs and words in language. Every ROI and word carries specific functional meanings, and when combined into brain networks or sentences, they form more complicated concepts. Furthermore, the functional meaning of ROIs and words typically exhibit stability among different human individuals or across sentences. Therefore, motivated by language modeling research, we assign a learnable $d$ -dimensional vector, referred to as ROI embedding, to each brain ROI. Then $N$ ROIs that cover the entire brain cortical regions form an embedding matrix denoted as ${\bm{E}}\in\mathbb{R}^{N\times d}$ .

Region-aware Graph Attention Module. Brain functions as a network, with brain ROIs essentially interconnected to form a functional graph. Within this graph, each ROI is considered a node, with node feature represented by ${\bm{x}}\in\mathbb{R}^{d}$ (Note: the ROI time-series signal of length $\tau$ is first projected into $d$ -dimensional space). The set of nodes in the graph is denoted as $\mathcal{V}$ .

Brain ROI activities are intrinsically governed by both structural and functional networks. ROIs that are functionally similar tend to collaborate and exhibit synchronized activities [5]. Drawing from these biological insights, and considering that functional relevance between ROIs can be quantified by the similarity in embeddings, we define the embedding-based graph adjacency matrix ${\bm{A}}\in\mathbb{R}^{N\times N}$ . As illustrated in Figure 1B, each entry contains the edge weight between nodes $i,j\in\mathcal{V}$ :

\displaystyle{\bm{A}}_{ij}=s({\bm{W}}_{q}{\bm{e}}_{i},{\bm{W}}_{k}{\bm{e}}_{j}).

(1)

In this equation, $s:\mathbb{R}^{d}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}$ is the similarity measurement of two vectors, e.g., scaled dot-product and ${\bm{e}}_{i}$ , ${\bm{e}}_{j}$ are the embedding vectors for node $i$ and $j$ respectively. The weight matrices ${\bm{W}}_{q},{\bm{W}}_{k}\in\mathbb{R}^{d\times d}$ are learnable and introduce asymmetry into ${\bm{A}}$ , representing the asymmetric information flow between two ROIs. Then, adopting the idea of graph attention mechanism, we derive the attention weight between node $i$ and node $j\in\mathcal{V}\backslash\{i\}$ as:

\displaystyle\alpha_{ij}=\mathrm{softmax}_{j}({\bm{A}}_{i})=\frac{\exp({\bm{A}% }_{ij})}{\sum_{k\in\mathcal{V}\backslash\{i\}}\exp({\bm{A}}_{ik})}.

(2)

Grounded in the synchronized nature among functionally related ROIs, self-loops are removed from the attention. This design prevents the attention from favoring its own node, enhancing reliability of feature learning by aggregating information from functionally connected nodes, thus reducing its sensitivity to input noise of each node. Hence the feature extracted by the module for node $i$ is

\displaystyle{\bm{x}}^{\prime}_{i}=\sum\nolimits_{j\in\mathcal{V}\backslash\{i% \}}\alpha_{ij}{\bm{x}}_{j}

(3)

For implementation, we integrate the region-aware graph attention into the standard transformer block [50], employing ROI embeddings as both key and query, and using input node feature as value.

2.2 Brain Masked AutoEncoder

In order to effectively capture the temporal dynamics and extract the genuine representation from fMRI signals, we utilize a transformer-based encoder-decoder architecture and design a novel nontrivial self-supervised task for pretraining, as shown in Figure 1A.

Temporal Segmentation. Similar to previous studies on vision transformers, where 2D images are divided into non-overlap** patches [19], we temporally segment the fMRI signals. Each fMRI segment has the shape of $N\times\tau$ , where $\tau$ represents the length of segment. Such segmentation allows transformer-like models to be seamlessly applied, as each fMRI segment can be viewed as a token, and thus the original fMRI data can be represented as a sequence of tokens. Throughout our study, $\tau$ is set to 15 seconds aligning with the typical duration of transient events in fMRI data [43, 6].

Transient State Encoders (TSEs). We embed each fMRI segment denoted as $\mathbf{X}_{k}\in\mathbb{R}^{N\times\tau}$ using Transient State Encoders (TSEs) with detailed architecture illustrated in Figure 1C. We introduce two types of TSEs, namely Static-Graph TSE (SG-TSE) and Adaptive-Graph TSE (AG-TSE), indicating how node features are learned.

Both TSEs consist of three transformer blocks but differ in the attention mechanisms applied within these layers. For SG-TSE, all three transformer blocks exclusively employ region-aware graph attention mechanism, assuming “static" connectivity among brain regions. On the other hand, AG-TSE incorporates two self-attention blocks stacked on top of a region-aware attention block, allowing the attention to be “adaptive" to the node input signal, enabling the model to capture the transient reorganization of brain connectivity. The output from the transformer blocks forms a matrix ${\bm{X}}^{o}\in\mathbb{R}^{N\times d}$ , where each row represents the feature learned for each ROI. We employ a linear projection $g:\mathbb{R}^{N\times d}\longrightarrow\mathbb{R}^{d}$ to aggregate all of the ROI features into a single vector $s_{k}\in\mathbb{R}^{d}$ . This vector, as the output of TSE, represents the transient state of fMRI segment $k$ .

Segment-wise ROI Masking. Different from the masked autoencoding commonly used in image or language modeling studies [19, 13], where tokens or image patches are typically masked out, we employ a segment-wise ROI masking approach. Specifically, for each fMRI segment, we randomly choose a subset of the ROIs, such as 70% of the ROIs, and then mask out selected ROI signals within that segment. The masked ROI segments are replaced with a masked token, a shared and learnable $d$ -dimensional vector to indicate the presence of missing ROI signals. This masking scheme introduces a nontrivial reconstruction task, guiding the model to learn functional relationships between ROIs.

Autoencoder. We employ a transformer-based autoencoder to capture both the temporal relationships between fMRI segments and extract the overall fMRI representation. The encoder maps the input sequence of transient state embeddings generated by the TSE ( ${\bm{s}}_{1},{\bm{s}}_{2}…,{\bm{s}}_{n}$ ) to a sequence of hidden representations ( ${\bm{h}}_{1},{\bm{h}}_{2},…,{\bm{h}}_{n}$ ). Based on these hidden representations, the decoder reconstructs the fMRI segments ( $\hat{{\bm{X}}}_{1},\hat{{\bm{X}}}_{2},...,\hat{{\bm{X}}}_{n}$ ). Both the encoder and decoder consist of two standard transformer blocks and position embeddings are added for all tokens in both the encoder and decoder. The decoder is only used in pre-training phase and omitted from downstream task fine-tuning.

Reconstruction Loss. We compute Mean Squared Error (MSE) loss to evaluate the reconstruction error for masked ROI segments and unmasked ROI segments separately:

	$\displaystyle\mathcal{L}_{\text{mask}}$	$\displaystyle=\sum^{n}_{k=1}\frac{1}{n\tau\|\Omega_{k}\|}\sum_{i\in\Omega_{k}}\\|% \hat{{\bm{X}}}_{k,i}-{\bm{X}}_{k,i}\\|^{2}$		(4)
	$\displaystyle\mathcal{L}_{\text{unmask}}$	$\displaystyle=\sum^{n}_{k=1}\frac{1}{n\tau\|\mathcal{V}\backslash\Omega_{k}\|}% \sum_{i\in\mathcal{V}\backslash\Omega_{k}}\\|\hat{{\bm{X}}}_{k,i}-{\bm{X}}_{k,i% }\\|^{2}$		(5)

where $n$ is total number of fMRI segments, $\Omega_{k}$ is the set of masked ROI in the $k$ -th segments, ${\bm{X}}_{k,i}$ is the $k$ -th fMRI segment of the $i$ -th ROI and $\hat{{\bm{X}}}_{k,i}$ is the reconstructed one. The total reconstruction loss for pretraining the model is the weighted sum of the two:

\displaystyle\mathcal{L}=\lambda\mathcal{L}_{\text{mask}}+(1-\lambda)\mathcal{% L}_{\text{unmask}}

(6)

where $\lambda$ is a hyperparameter, and in our study, we set $\lambda$ to a fixed value of 0.75 to penalize more on the reconstruction loss of masked ROI segments.

BrainMAEs. Based on the choice of TSE, we introduce two variants of BrainMAE: SG-BrainMAE and AG-BrainMAE, incorporating SG-TSE and AG-TSE for transient state encoding, respectively.

3 Experiments

3.1 Model Validation with Synthetic Data

We posit that our model design and tailored self-supervised pre-training scheme facilitate the learning of region-specific information within the ROI embeddings. To test this claim, we conduct a simulation study where SG-BrainMAE is pre-trained using a synthetic dataset. We pre-define a connectivity matrix among ROIs (Figure 2A) and generate synthetic signals according to the criterion that signals between two ROIs with high connectivity tend to exhibit similar fluctuation (Figure 2B; See Appendix G for detailed simulation setup). As shown in Figure 2C and 2D, the similarity matrix of ROI embedding trained with this synthetic dataset, accurately captures the ground truth connectivity. Furthermore, the t-SNE plot of ROI embeddings demonstrates that ROIs with strong connectivity are correctly closely clustered, thereby providing validation for our claim and design principles.

3.2 fMRI Datasets

We mainly use the following fMRI datasets to evaluate our approach. HCP-3T dataset [49] is a large-scale publicly available dataset that includes 3T fMRI data from 897 healthy adults aged between 22 and 35. We use both the resting-state and task sessions as well as the behavior measurements in our study. HCP-7T dataset is a subset of the HCP S1200 release, consisting of 7T fMRI data from 184 subjects within the age range of 22 to 35. Our analysis focused on the resting-state sessions of this dataset. HCP-Aging dataset [18], designed for aging studies, contains 725 subjects aged 36 to 100+. Age, gender information as well as the resting-state fMRI data are used in our study. NSD dataset [2] is a massive 7T fMRI dataset, featuring 8 human participants, each with 30-40 scan sessions conducted over the course of a year. For our study, we incorporated both task fMRI data and task performance metrics, including task scores and response times (RT). Detailed information regarding each of the datasets can be found in Appendix 8.

3.3 Pre-training Evaluation

3.3.1 Implementation Details

During the pretraining phase, each time we randomly select 300 seconds of fixed-length fMRI signals from the original sample and divide this signal into 20 segments, with each segment containing 15 seconds of fMRI data. We use a variable mask ratio for each mini-batch during training. The mask ratio for each mini-batch is drawn from a range of (0, 0.8), where 0 indicates no masking is applied. For each pretrained dataset, we train the model using all available samples for 1000 epochs. Additional training settings are available in Appendix A.1.

3.3.2 Masked Signal Reconstruction

We assess the reconstruction performance of the HCP-3T pretrained model on unseen HCP-7T dataset. Example reconstruction results are shown in Figure 3A and Appendix I. Figure 3B presents comparative analysis of the reconstruction performance, quantified by $R^{2}$ values, between the proposed models and variants employing only self-attention within the TSE module. These variants are denoted as vanilla-BrainMAE (pretrained using the proposed masked autoencoding, see B.1) and vanilla-BrainAE (pretrained with standard autoencoding, see B.2). Both SG-BrainMAE and AG-BrainMAE achieve better reconstruction performance across all mask ratios, suggesting the incorporation of region-aware graph attention is advantageous for learning generalized representations.

3.3.3 ROI Embeddings

It is crucial to validate whether the ROI embeddings pretrained with the proposed approach truly satisfy the aforementioned three ROI characteristics in section 2.1.

Functional Specificity. We visualize t-SNE transformed ROI embeddings in Figure 3C [47]. In the projected 2D space, the ROIs exhibit discernible clustering that aligns with the Yeo-17 networks’ definitions in neuroscience studies [57]. This alignment suggests that ROIs with similar functional roles (e.g., visual processing) display similar embeddings. In other words, specific brain functions can be inferred from these embeddings.

Table 1: Results for behavior prediction.

Model	Gender		Behaviors (measured in MAE)
Model	Accuracy (%)	AUROC (%)	PicSeq	PMT $\_$ CR	PMT $\_$ SI	PicVocab	IWRD	ListSort	LifeSatisf	PSQI
Fixed-FC
BrainNetTF-OCR	94.11±0.98	94.39±0.90	7.11±0.22	2.28±0.11	1.72±0.12	4.70±0.10	1.56±0.03	5.96±0.07	4.96±0.22	1.49±0.05
BrainNetTF-Vanilla	90.00±1.05	89.93±0.94	8.19±0.46	2.73±0.07	2.13±0.07	5.93±0.14	1.81±0.08	6.91±0.21	5.71±0.08	1.68±0.05
BrainNetCNN	90.68±1.80	90.89±1.55	10.21±0.22	3.25±0.13	2.64±0.14	6.65±0.27	2.26±0.05	8.51±0.20	7.12±0.24	1.68±0.05
Dynamic-FC
STAGIN-SERO	88.73±1.36	88.69±1.41	10.22±0.15	3.49±0.05	2.70±0.06	6.78±0.20	2.26±0.05	8.51±0.20	7.12±0.24	2.12±0.04
STAGIN-GARO	88.34±0.94	88.33±0.91	10.26±0.18	3.44±0.10	2.69±0.09	6.92±0.30	2.25±0.04	8.52±0.26	7.09±0.35	2.08±0.04
FBNETGNN	88.05±0.15	87.93±0.97	8.62±0.21	2.93±0.11	2.34±0.11	5.83±0.15	1.96±0.04	7.31±0.10	6.09±0.10	1.81±0.03
Ours
vanilla-BrainAE	94.11±1.02	94.07±1.09	7.63±0.27	2.50±0.12	1.92±0.11	5.01±0.17	1.67±0.04	6.45±0.28	5.40±0.13	1.59±0.05
vanilla-BrainMAE	95.80±1.23	96.13±1.09	5.11±0.15	1.69±0.06	1.30±0.07	3.40±0.11	1.12±0.05	4.33±0.16	3.60±0.15	1.05±0.06
SG-BrainMAE	97.49±0.15	97.46±0.18	5.06±0.21	1.63±0.08	1.24±0.04	3.40±0.14	1.11±0.04	4.35±0.12	3.64±0.27	1.05±0.06
AG-BrainMAE	97.13±0.56	97.17±0.61	5.09±0.05	1.67±0.10	1.28±0.06	3.34±0.11	1.13±0.03	4.37±0.06	3.58±0.17	1.07±0.05

Table 2: Results for age prediction.

Model	Gender		Age (MAE)
Model	Accuracy (%)	AUROC (%)	Age (MAE)
Fixed-FC
BrainNetTF-OCR	90.21±3.81	90.73±2.85	6.15±0.71
BrainNetTF-Vanilla	88.96±2.16	88.76±2.21	6.78±0.56
BrainNetCNN	88.83±1.52	88.74±1.58	8.71±0.62
Dynamic-FC
STAGIN-SERO	82.37±1.66	82.57±1.36	8.96±0.47
STAGIN-GARO	80.67±0.81	80.58±1.03	8.65±0.28
FBNETGNN	89.50±3.58	89.34±3.49	6.68±1.00
Ours
vanilla-BrainAE	80.92±2.40	81.03±2.52	8.33±0.49
vanilla-BrainMAE	88.54±2.50	88.53±2.37	7.26±0.95
SG-BrainMAE	92.67±1.07	92.51±1.07	5.78±0.44
AG-BrainMAE	91.12±1.99	91.15±2.03	6.49±1.00

Table 3: Results for task performance prediction.

Model	Task score (MAE)	RT (MAE in ms)
Fixed-FC
BrainNetTF-OCR	0.070±0.003	92.344±2.343
BrainNetTF-Vanilla	0.075±0.004	96.252±2.133
BrainNetCNN	0.078±0.004	102.911±2.225
Dynamic-FC
STAGIN-SERO	0.089±0.003	116.635±2.197
STAGIN-GARO	0.091±0.002	116.130±2.099
FBNETGNN	0.074±0.005	95.349±2.320
Ours
vanilla-BrainAE	0.091±0.004	118.965±3.047
vanilla-BrainMAE	0.083±0.004	108.215±3.458
SG-BrainMAE	0.069±0.004	90.678±1.767
AG-BrainMAE	0.070±0.003	92.154±2.265

Functional Connectivity. Interestingly, the arrangement of the ROI in the projected 2D space also reflects the cortical hierarchy, as indicated by principal gradient (PG) values (Figure 3D-E) [33, 17, 39]. Low PG values correspond to cortical low-order regions, such as visual and somatosensory regions, while high PG values correspond to cortical high-order regions, including the default mode network and limbic system. Therefore, the interconnectivity between different brain networks, captured by the PG, thus can also be informed by the ROI embeddings.

Inter-Individual Consistency. We separately pretrain the SG-BrainMAE on the HCP-3T, HCP-7T, and NSD task datasets. Both HCP-3T and HCP-7T datasets have two different preprocessing pipelines, namely minimal preprocessing and FIX-ICA. Consequently, we pretrain models for each combination. In total, we obtain five independently pretrained models. For each pretrained model, we generate an embedding similarity matrix by computing pairwise cosine similarities between ROIs embeddings, as shown in Figure 3F-G. Importantly, these similarity matrices exhibit consistent patterns across different datasets, regardless of preprocessing pipelines or fMRI task types (resting or task), suggesting the converging ROI representations.

Despite general consistency of ROI embeddings across datasets, the modular structure of network constructed based on embeddings similarity matrix, quantified by modularity, shows a reduction with aging (Figure 9), aligning with the established findings in neuroscience research [45, 54]. More ROI analysis is shown in Appendix D.

3.3.4 Representation Analysis

To study pretrained fMRI representation (output of CLS token $h_{[CLS]}$ ), we evaluate the HCP-3T pretrained model on unseen NSD dataset. Notably, as shown in Figure 3H and 3I, the t-SNE plot of the fMRI representation demonstrates clear separation between fMRI sessions from different subjects. Within individuals, resting-state fMRI sessions are distinctly separated from task-based fMRI sessions. These results suggest that the pre-trained fMRI representation carries valuable information for distinguishing individuals and reflecting their respective brain arousal states.

3.4 Transfer Learning Evaluation

3.4.1 Implementation Details

We pretrain the model using HCP-3T ICA-FIX preprocessed data and fine-tune the whole network except for the ROI embeddings for each downstream task. Similar to previous studies [19, 13], we use the CLS token from the encoder and append a task-specific linear head for prediction. Cross-entropy loss and mean squared error (MSE) loss are used for classification and regression fine-tuning tasks respectively. For fair comparisons, we use 5-fold cross-validation for model evaluation. Detailed information regarding the architecture and training settings can be found in Appendix A.2.

3.4.2 Steady-state Variables Prediction

For the prediction of steady-state variables, models typically employ entire fMRI signals spanning hundreds of seconds to make predictions regarding an individual’s traits, age, and gender information. We benchmark our approach against two categories of baseline neural network models specifically designed for these tasks. (1) Baseline models based on fixed brain network (Fixed-FC). BrainNetTF-OCR [22] is transformer-based model with Orthonormal Clustering Readout (OCR), representing the state-of-the-art method for brain network analysis. BrainNetTF-Vanilla is a variant of BrainNetTF with CONCAT-based readout. BrainNetCNN [23] follows the CNN paradigm by modeling the functional connectivity matrices similarly as 2D images. (2) Baseline models based on dynamic brain network (Dynamic-FC). STAGIN [24] is a transformer-based model that learns dynamic graph representation with spatial-temporal attention. Two variants, STAGIN-GARO and STAGIN-SERO that use different readout methods are included for comparison. FBNETGEN [21] is a GNN-based model learning the brain network from the fMRI time-series signals.

We compare BrainMAEs against baseline methods across three distinct downstream tasks. (1) Behaviors prediction. In this task, the models are required to simultaneously perform gender classification as well as predict 8 cognitive behaviors (See Table 18 for details) using HCP-3T dataset. (2) Age prediction. For this task, the models are required to simultaneously perform gender classification and age prediction on HCP-Aging dataset. (3) Task performance prediction. For this task, the models are required to predict the averaged memory task score as well as the averaged response time (RT) for each fMRI run using NSD dataset.

The results for the three downstream tasks are shown in Table 1, 3, and 3 and reported as mean $\pm$ std. from 5-fold cross-validation, with regression variables measured in mean absolute error (MAE). Our proposed methods consistently demonstrate superior performance across all these tasks.

Notably, despite the informative nature of temporal features, the three baseline models that leverage dynamic FC or learn the FC from the fMRI time series consistently underperform compared to models that utilize fixed FC. One plausible explanation could be attributed to the inherent noise present in fMRI data [24]. In contrast, even though BrainMAEs encode the complete temporal information of fMRI, they still achieve the highest level of performance. This achievement can be attributed to the region-aware graph attention module employed in our model design.

3.4.3 Transient Mental State Decoding

We evaluate capability of BrainMAEs to decode transient brain states lasting only tens of seconds, a task that poses challenges for FC-based models due to the emergence of spurious connectivity with such short time window [41]. Hence, we compare our methods with current state-of-the-art transformer-based self-supervised learning approaches tailored for fMRI transient state modeling.

Baseline self-supervised approaches [46]. All of the baseline methods consider each time point as word and leverage recent successes in NLP pre-training techniques to model fMRI transient states. Causal Sequence Modeling (CSM) is pretrained based on the principle of causal language modeling, predicting the next signal using historical context. Sequence-BERT performs self-supervised learning by solving masked-language-modeling and next-sentence-prediction tasks. Network-BERT, a variant to Sequence-BERT, is designed to infer the entire timeseries of a masked network of ROIs.

Following the same experimental setup as in [46], we use publicly available HCP-3T task datasets [49] and identify 20 mental states across experimental tasks. As shown in Table 4, BrainMAEs outperform other self-supervised approaches. It’s important to note that we only use 100 cortical ROIs, in contrast to baseline methods utilizing 1024 Dictionary Learning Functional Modes (DiFuMo) ROIs including subcortical areas [11]. This comparison leads to two insights: firstly, cortical region activity alone might suffice for decoding mental states; secondly, given the highly correlated nature of fMRI signals, a small set of ROIs can effectively represent brain activity, potentially enabling the development of more efficient models for future research. Furthermore, AG-BrainMAE exhibits enhanced performance relative to SG-BrainMAE, indicating that integrating an adaptive component is beneficial for capturing transient state changes, while SG-BrainMAE is more suitable for steady-state variables prediction. Table 9 presents detailed results for each mental state from the multi-class decoding task, demonstrating consistently high decoding accuracy that appears to be insensitive to the duration of each distinct mental state.

3.4.4 Ablation Study

We conduct ablation studies on the aforementioned four downstream tasks to elucidate the advantage of incorporating the region-aware graph attention module and masked autoencoding in our approach. We compare four model variants: (1) SG-BrainMAE, (2) AG-BrainMAE, (3) vanilla-BrainMAE (BrainMAE without the region-aware attention modules, see B.1 for details), and (4) vanilla-BrainAE (sharing the same architecture as vanilla-BrainMAE but pretrained with standard autoencoding, see B.2 for details). The differences between these variants are detailed in Table 7. The results, which are presented in Table 1, 3, 3 and 4, indicate a degradation in performance when the region-aware attention is removed, and a further significant decrease in performance when masked pretraining is excluded. This underscores the advantage of incorporating both components in BrainMAE framework.

3.4.5 Interpretation Analysis

We interpret BrrainMAE fine-tuned on NSD task performance dataset. We evaluate self-attention scores used to generate the CLS representation of the final layers of the encoder transformer. These attention scores provide insights into which time segments are crucial for the model to make predictions. As illustrated in Figure 4, we average the attention scores across different fMRI runs. Notably, the attention score reveals the NSD task block structure, indicating that the model inherently places more importance on task blocks to infer overall task performance. Interestingly, the attention score correlates to behavioral arousal measurements (inverse response time [32]), suggesting the model is aware of the change of transient brain state. Indeed, the learned fMRI representation also highly correlates with brain arousal index (Appendix E.1). Overall, these results underline the interpretability of BrainMAE, hinting at its potential to explore brain mechanisms for neuroscience research.

4 Related Work

Masked autoencoding. The incorporation of masked autoencoding for self-supervised representation learning has seen significant interest across various domains. In the realm of NLP, models like BERT [13] and GPT [7, 37, 38] employ masked autoencoding to pretrain language models by predicting the missing components of input sequences. In computer vision, masked modeling has been integrated with vision transformers, yielding successful approaches such as MAE [19], BEiT [4], and BEVT [53] for feature learning. Limited studies explore masked autoencoding in the context of fMRI. [46] adapt BERT-like framework by considering each time point as ’word’, which is suitable for modeling transient states but limited in scaling to fMRI signals of hundreds seconds. Our approach differs by treating transient state as basic unit for sequence learning, allowing it to scale effectively and extract representations that reflect individual traits and behaviors.

Brain Network Analysis. GNN-based models have been widely used in the field of brain network analysis [28, 40, 56, 30, 58, 1, 27, 29, 10]. Models like GroupINN [55] introduce the concept of node grou** to enhance interpretability and reduce model size. BrainNetCNN [23] capitalizes on the topological locality of structural brain networks for graph-based feature learning. BrainNetTF [22], on the other hand, is a transformer-based model that employs orthonormal clustering readout to facilitate cluster-aware graph embedding learning. STAGIN [24] focuses on learning dynamic graph representations through spatial-temporal attention mechanisms, while FBNetGen [21] directly learns the brain network from fMRI time-series data. In contrast to these methods, which predominantly learn representation from functional connectivity (FC), our approach can effectively incorporate valuable temporal information while mitigating the impact of fMRI’s intrinsic noise through specifically designed modules.

5 Discussion and Conclusion

Here we propose BrainMAE for effectively learning the representation from fMRI time series data. Our approach integrates two essential components: a region-aware graph attention module and a masked self-supervised pretraining framework. These components are designed to capture temporal dynamics while mitigating the inherent noise in fMRI data. The alignment of the learned ROI embeddings with existing neuroscience knowledge, along with the improvement in transfer learning tasks, confirms the effectiveness of our design.

By providing a task-agnostic representation, BrainMAE exhibits promise for applications in the field of neuroscience. Its interpretability and ability to capture transient representations make it a valuable tool for uncovering the mechanisms and dynamics of transient state changes within the brain.

Furthermore, our approach can be extended beyond brain fMRI data. It can be applied to various domains that can be modeled as networks of functionally meaningful nodes. For instance, it could be applied to traffic network analysis, where the nodes represent either roads or spatially defined regions.

References

[1] David Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman, Clinton Fookes, and Lars Petersson. Graph-based deep learning for medical diagnosis and analysis: past, present and future. Sensors, 21(14):4758, 2021.
[2] Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
[3] Sonsoles Alonso and Diego Vidaurre. Toward stability of dynamic fc estimates in neuroimaging and electrophysiology: Solutions and limits. Network Neuroscience, 7(4):1389–1403, 2023.
[4] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[5] Danielle S Bassett and Olaf Sporns. Network neuroscience. Nature neuroscience, 20(3):353–364, 2017.
[6] Taylor Bolt, Jason S Nomi, Danilo Bzdok, Jorge A Salas, Catie Chang, BT Thomas Yeo, Lucina Q Uddin, and Shella D Keilholz. A parsimonious description of global functional brain organization in three spatiotemporal patterns. Nature neuroscience, 25(8):1093–1103, 2022.
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[8] Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature reviews neuroscience, 10(3):186–198, 2009.
[9] Catie Chang, David A Leopold, Marieke Louise Schölvinck, Hendrik Mandelkow, Dante Picchioni, Xiao Liu, Frank Q Ye, Janita N Turchi, and Jeff H Duyn. Tracking brain arousal fluctuations with fmri. Proceedings of the National Academy of Sciences, 113(16):4518–4523, 2016.
[10] Hejie Cui, Wei Dai, Yanqiao Zhu, Xiaoxiao Li, Lifang He, and Carl Yang. Interpretable graph neural networks for connectome-based brain disorder analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 375–385. Springer, 2022.
[11] Kamalaker Dadi, Gaël Varoquaux, Antonia Machlouzarides-Shalit, Krzysztof J Gorgolewski, Demian Wassermann, Bertrand Thirion, and Arthur Mensch. Fine-grain atlases of functional modes for fmri analysis. NeuroImage, 221:117126, 2020.
[12] Emily L Dennis and Paul M Thompson. Functional brain connectivity using fmri in aging and alzheimer’s disease. Neuropsychology review, 24:49–62, 2014.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[14] Luiz Kobuti Ferreira and Geraldo F Busatto. Resting-state functional connectivity in normal brain aging. Neuroscience & Biobehavioral Reviews, 37(3):384–400, 2013.
[15] Michael Greicius. Resting-state functional connectivity in neuropsychiatric disorders. Current opinion in neurology, 21(4):424–430, 2008.
[16] Yameng Gu, Feng Han, Lucas E Sainburg, and Xiao Liu. Transient arousal modulations contribute to resting-state functional connectivity changes associated with head motion parameters. Cerebral Cortex, 30(10):5242–5256, 2020.
[17] Yameng Gu, Lucas E Sainburg, Sizhe Kuang, Feng Han, Jack W Williams, Yikang Liu, Nanyin Zhang, Xiang Zhang, David A Leopold, and Xiao Liu. Brain activity fluctuations propagate as waves traversing the cortical hierarchy. Cerebral cortex, 31(9):3986–4005, 2021.
[18] Michael P Harms, Leah H Somerville, Beau M Ances, Jesper Andersson, Deanna M Barch, Matteo Bastiani, Susan Y Bookheimer, Timothy B Brown, Randy L Buckner, Gregory C Burgess, et al. Extending the human connectome project across ages: Imaging protocols for the lifespan development and aging projects. Neuroimage, 183:972–984, 2018.
[19] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
[20] R Matthew Hutchison, Thilo Womelsdorf, Elena A Allen, Peter A Bandettini, Vince D Calhoun, Maurizio Corbetta, Stefania Della Penna, Jeff H Duyn, Gary H Glover, Javier Gonzalez-Castillo, et al. Dynamic functional connectivity: promise, issues, and interpretations. Neuroimage, 80:360–378, 2013.
[21] Xuan Kan, Hejie Cui, Joshua Lukemire, Ying Guo, and Carl Yang. Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation. In International Conference on Medical Imaging with Deep Learning, pages 618–637. PMLR, 2022.
[22] Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, and Carl Yang. Brain network transformer. Advances in Neural Information Processing Systems, 35:25586–25599, 2022.
[23] Jeremy Kawahara, Colin J Brown, Steven P Miller, Brian G Booth, Vann Chau, Ruth E Grunau, Jill G Zwicker, and Ghassan Hamarneh. Brainnetcnn: Convolutional neural networks for brain networks; towards predicting neurodevelopment. NeuroImage, 146:1038–1049, 2017.
[24] Byung-Hoon Kim, Jong Chul Ye, and Jae-** Kim. Learning dynamic graph representation of brain connectome with spatio-temporal attention. Advances in Neural Information Processing Systems, 34:4314–4327, 2021.
[25] Dorit Kliemann, Ralph Adolphs, J Michael Tyszka, Bruce Fischl, BT Thomas Yeo, Remya Nair, Julien Dubois, and Lynn K Paul. Intrinsic functional connectivity of the brain in adults with a single cerebral hemisphere. Cell reports, 29(8):2398–2407, 2019.
[26] **gwei Li, Ru Kong, Raphaël Liégeois, Csaba Orban, Yanrui Tan, Nanbo Sun, Avram J Holmes, Mert R Sabuncu, Tian Ge, and BT Thomas Yeo. Global signal regression strengthens association between resting-state functional connectivity and behavior. NeuroImage, 196:126–141, 2019.
[27] Xiaoxiao Li, Nicha C Dvornek, Yuan Zhou, Juntang Zhuang, Pamela Ventola, and James S Duncan. Graph neural network for interpreting task-fmri biomarkers. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part V 22, pages 485–493. Springer, 2019.
[28] Xiaoxiao Li, Yuan Zhou, Nicha Dvornek, Muhan Zhang, Siyuan Gao, Juntang Zhuang, Dustin Scheinost, Lawrence H Staib, Pamela Ventola, and James S Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021.
[29] Xiaoxiao Li, Yuan Zhou, Nicha C Dvornek, Muhan Zhang, Juntang Zhuang, Pamela Ventola, and James S Duncan. Pooling regularized graph neural network for fmri biomarker analysis. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VII 23, pages 625–635. Springer, 2020.
[30] Lingwen Liu, Guangqi Wen, Peng Cao, Tianshun Hong, **zhu Yang, Xizhe Zhang, and Osmar R Zaiane. Braintgl: A dynamic graph representation learning model for brain network analysis. Computers in Biology and Medicine, 153:106521, 2023.
[31] Xiao Liu, Jacco A De Zwart, Marieke L Schölvinck, Catie Chang, Frank Q Ye, David A Leopold, and Jeff H Duyn. Subcortical evidence for a contribution of arousal to fmri studies of brain activity. Nature communications, 9(1):395, 2018.
[32] Elena Makovac, Sabrina Fagioli, David R Watson, Frances Meeten, Jonathan Smallwood, Hugo D Critchley, and Cristina Ottaviani. Response time as a proxy of ongoing mental state: A combined fmri and pupillometry study in generalized anxiety disorder. Neuroimage, 191:380–391, 2019.
[33] Daniel S Margulies, Satrajit S Ghosh, Alexandros Goulas, Marcel Falkiewicz, Julia M Huntenburg, Georg Langs, Gleb Bezgin, Simon B Eickhoff, F Xavier Castellanos, Michael Petrides, et al. Situating the default-mode network along a principal gradient of macroscale cortical organization. Proceedings of the National Academy of Sciences, 113(44):12574–12579, 2016.
[34] Linden Parkes, Ben Fulcher, Murat Yücel, and Alex Fornito. An evaluation of the efficacy, reliability, and sensitivity of motion correction strategies for resting-state functional mri. Neuroimage, 171:415–436, 2018.
[35] Jonathan D Power, Alexander L Cohen, Steven M Nelson, Gagan S Wig, Kelly Anne Barnes, Jessica A Church, Alecia C Vogel, Timothy O Laumann, Fran M Miezin, Bradley L Schlaggar, et al. Functional network organization of the human brain. Neuron, 72(4):665–678, 2011.
[36] Maria Giulia Preti, Thomas AW Bolton, and Dimitri Van De Ville. The dynamic functional connectome: State-of-the-art and perspectives. Neuroimage, 160:41–54, 2017.
[37] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[39] Ryan V Raut, Abraham Z Snyder, Anish Mitra, Dov Yellin, Naotaka Fujii, Rafael Malach, and Marcus E Raichle. Global waves synchronize the brain’s functional systems with fluctuating arousal. Science advances, 7(30):eabf2709, 2021.
[40] Anwar Said, Roza Bayrak, Tyler Derr, Mudassir Shabbir, Daniel Moyer, Catie Chang, and Xenofon Koutsoukos. Neurograph: Benchmarks for graph machine learning in brain connectomics. Advances in Neural Information Processing Systems, 36:6509–6531, 2023.
[41] Antonis D Savva, Georgios D Mitsis, and George K Matsopoulos. Assessment of dynamic functional connectivity in resting-state fmri using the sliding window technique. Brain and behavior, 9(4):e01255, 2019.
[42] Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eickhoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri. Cerebral cortex, 28(9):3095–3114, 2018.
[43] James M Shine, Patrick G Bissett, Peter T Bell, Oluwasanmi Koyejo, Joshua H Balsters, Krzysztof J Gorgolewski, Craig A Moodie, and Russell A Poldrack. The dynamics of functional brain networks: integrated network states during cognitive task performance. Neuron, 92(2):544–554, 2016.
[44] Stephen M Smith, Thomas E Nichols, Diego Vidaurre, Anderson M Winkler, Timothy EJ Behrens, Matthew F Glasser, Kamil Ugurbil, Deanna M Barch, David C Van Essen, and Karla L Miller. A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nature neuroscience, 18(11):1565–1567, 2015.
[45] Olaf Sporns and Richard F Betzel. Modular brain networks. Annual review of psychology, 67:613–640, 2016.
[46] Armin Thomas, Christopher Ré, and Russell Poldrack. Self-supervised learning of brain dynamics from broad neuroimaging data. Advances in Neural Information Processing Systems, 35:21255–21269, 2022.
[47] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
[48] Koene RA Van Dijk, Mert R Sabuncu, and Randy L Buckner. The influence of head motion on intrinsic functional connectivity mri. Neuroimage, 59(1):431–438, 2012.
[49] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[51] Diego Vidaurre, Mark W Woolrich, Anderson M Winkler, Theodoros Karapanagiotidis, Jonathan Smallwood, and Thomas E Nichols. Stable between-subject statistical inference from unstable within-subject functional connectivity estimates. Human brain map**, 40(4):1234–1243, 2019.
[52] Kun Wang, Meng Liang, Liang Wang, Lixia Tian, Xinqing Zhang, Kuncheng Li, and Tianzi Jiang. Altered functional connectivity in early alzheimer’s disease: A resting-state fmri study. Human brain map**, 28(10):967–978, 2007.
[53] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14733–14743, 2022.
[54] Gagan S Wig. Segregated systems of human brain networks. Trends in cognitive sciences, 21(12):981–996, 2017.
[55] Yujun Yan, Jiong Zhu, Marlena Duda, Eric Solarz, Chandra Sripada, and Danai Koutra. Groupinn: Grou**-based interpretable neural network for classification of limited, noisy brain data. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 772–782, 2019.
[56] Yi Yang, Yanqiao Zhu, Hejie Cui, Xuan Kan, Lifang He, Ying Guo, and Carl Yang. Data-efficient brain connectome analysis via multi-task meta-learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4743–4751, 2022.
[57] BT Thomas Yeo, Fenna M Krienen, Jorge Sepulcre, Mert R Sabuncu, Danial Lashkari, Marisa Hollinshead, Joshua L Roffman, Jordan W Smoller, Lilla Zöllei, Jonathan R Polimeni, et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology, 2011.
[58] Kanhao Zhao, Boris Duka, Hua Xie, Desmond J Oathes, Vince Calhoun, and Yu Zhang. A dynamic graph convolutional neural network framework reveals new insights into connectome dysfunctions in adhd. Neuroimage, 246:118774, 2022.

Appendix A Experiment Details

A.1 Pretraining

Table 6 provides a summary of our pretraining configurations, which are used for training BrainMAE on various datasets. Our ROIs are extracted based on Schaefer2018_100ROIs Parcels, which include 100 cortical ROIs [42]. During the training process, we employ a random selection method to choose a continuous segment of fMRI signals lasting for 300 seconds. For instance, we randomly select 300 consecutive fMRI signals from the original 864 seconds of data. This ensures that the signal used for masking and subsequent reconstruction is of equal size, allowing for the construction of mini-batches for parallelized training.

Additionally, this approach offers the advantage of efficient GPU memory utilization and scalability. It also introduces a degree of randomness, which, to some extent, serves as data augmentation and benefits representation learning.

A.2 Transfer Learning

We fine-tune all BrainMAE models following the parameters outlined in Table 6. During both the training and testing phases, we utilize the original length of the fMRI data. For example, in the case of HCP3T with 864 seconds of data, we use the first 855 seconds, dividing it into 57 time segments that are fed into the model. The decoder is omitted during transfer learning. A task-specific linear head is appended to the CLS token representation generated by the encoder transformer for task-specific predictions, as shown in Figure 5A.

Table 5: Pretraining Settings

config	value
optimizer	AdamW
Training epochs	1000
weight decay	0.05
optim momentum	$\beta 1$ , $\beta 2$ = 0.9, 0.95
Base learning rate	0.001
$\lambda$	0.75
batch size	32
batch accumulation	4
learning rate schedule	cosine decay
warmup epochs	100

Table 6: Fine-tuning settings

config	value
optimizer	AdamW
Training epochs	150
Train:Val:Test (each fold)	0.64:0.2:0.16
weight decay	0.05
optim momentum	$\beta 1$ , $\beta 2$ = 0.9, 0.95
Base learning rate	0.001
batch size	64
batch accumulation	2
learning rate schedule	cosine decay
textclip_grad	5

Appendix B Model Variants

B.1 vanilla-BrainMAE

The vanilla-BrainMAE shares the exact same architecture as BrainMAE with the only exception being the use of vanilla-TSE to extract transient state embeddings from the fMRI segments. The vanilla-TSE, as shown in Figure 5, incorporates three standard transformer blocks that exclusively utilize self-attention. The vanilla-BrainMAE serves as a model for comparison with both AG-BrainMAE and SG-BrainMAE, allowing us to evaluate the proposed embedding-based static graph module.

B.2 vanilla-BrainAE

The vanilla-BrainAE employs the exact same architecture as vanilla-BrainMAE, with the only distinction being the use of traditional autoencoding for fMRI signal reconstruction without signal masking. vanilla-BrainAE is included as a comparative model to assess the proposed masked autoencoding approach in comparison to the other models.

Table 7: The distinction between different model variants

Model variants	Transient state encoder (TSE)	Pretraining
Primary
SG-BrainMAE	Three region-aware graph attention block	Masked-Autoencoding
AG-BrainMAE	Two self-attention blocks stacked on top of a region-aware graph attention attention block	Masked-Autoencoding
Other
vanilla-BrainMAE	Three self-attention blocks	Masked-Autoencoding
vanilla-BrainAE	Three self-attention blocks	Autoencoding
PosSG-BrainMAE	Three absolute position embedding informed attention block (see Appendix H.1)	Masked-Autoencoding
SG-BrainMAE (SL)	Three region-aware graph attention block where attentions are computed without self-loop removal (see Appendix H.2)	Masked-Autoencoding

Appendix C Datasets

Table 8: Dataset Statistics

	HCP-3T	HCP-7T	HCP-Aging	NSD
Number of subjects	897	184	725	8
Number of sessions	3422	720	2400	3120
Number of TRs	1200	1200	478	301
Orignal TR(s)	0.72	1.00	0.80	1.00
Number of TR interpolate to 1s	864		382
Type	Resting-state	Resting-state	Resting-state	TASK

HCP-3T/HCP-7T Datasets.

The Human Connectome Project (HCP) is a freely shared dataset from 1200 young adult (ages 22-35) subjects, using a protocol that includes structural images (T1w and T2w), functional magnetic resonance imaging (resting-state fMRI, task fMRI), and high angular resolution diffusion imaging (dMRI) at 3 Tesla (3T) and behavioral and genetic testing. Moreover, 184 subjects also have 7T MR scan data available (in addition to 3T MR scans), which includes resting-state fMRI, retinotopy fMRI, movie-watching fMRI, and dMRI. In our study, we focused on both the resting-state and task sessions of the dataset as well as 8 behavior measurements (see Table 18 for more information).

HCP-Aging Dataset.

The Human Connectome Project Aging (HCP-Aging) dataset is an extensive and longitudinally designed neuroimaging resource aimed at comprehensively investigating the aging process within the human brain. It comprises a wide array of multimodal neuroimaging data, such as structural MRI (sMRI), resting-state functional MRI (rs-fMRI), task-based fMRI (tfMRI), and diffusion MRI (dMRI), alongside rich cognitive and behavioral assessments. In our study, we focus on the resting-state session of the dataset as well as age and gender information.

NSD Dataset.

The Natural Scenes Dataset comprises whole-brain 7T functional magnetic resonance imaging (fMRI) scans at a high resolution of 1.8 mm. These scans were conducted on eight meticulously selected human participants, each of whom observed between 9,000 and 10,000 colorful natural scenes, totaling 22,000 to 30,000 trials, over the span of one year. While viewing these images, subjects were engaged in a continuous recognition task in which they reported whether they had seen each given image at any point in the experiment.

Table 8 provides an overview of the statistical information for each of the datasets employed in our study.

Appendix D ROI Embedding Analysis

D.1 Relationship to Principal Gradient

In neuroscience research, the principal gradient characterize the topographical organization of brain regions, reflecting the between network functional organization [33]. Along this principal direction, one end is associated with regions serving primary sensory/motor functions, while the other end corresponds to transmodal regions, often referred to as the default-mode network (DMN).

We have identified a significant relationship between the first principal component of pretrained ROI embeddings and the principal gradient. This finding suggests that the functional connectivity between brain networks is inherently encoded in these embeddings. Furthermore, this relationship is highly reproducible across pretrained models trained on different datasets, as shown in Figure 6 and 7.

D.2 Consistency Across Pretrained Models

We analyze the cross-region embedding similarity, or embeddings similarity matrix for each of the models pretrained on various datasets. We use the cosine distance to measure the similarity between two ROI embedding vectors. As shown in Figure 8, The embedding similarity matrices shows converging results on differently pretrained models, suggesting that highly similar embedding profiles can be identified in different datasets, thereby validating our hypothesis and the proposed approach.

D.3 Age Effects

The ROI embedding faithfully captures the characteristics of brain ROIs within the pre-trained fMRI dataset. To investigate the impact of aging on the acquired ROI embeddings, we partition the HCP-Aging dataset into three distinct, non-overlap** age groups (Young: 36-52, Middle: 52-68, Old: 68-100; refer to the Figure 9A for the age distribution in each group). Subsequently, we independently pre-train the SG-BrainMAE for each age group. To discern variations in ROI embeddings, we study the modular structure of the network constructed based on the embedding similarity matrix specific to each age group. The modularity of a network, serving as a metric in network analysis, delineates how the network can be segmented into nonoverlap** regions or the segregation of brain ROIs. The results in Figure 9B indicate a reduction in the modularity of ROI embeddings from the young age group to the old age group. This trend suggests a decline in the segregation of functional networks during aging, aligning with established findings in the neuroscience literature [45, 54].

Appendix E fMRI Representation Analysis

E.1 Brain States

E.1.1 Task vs. Rest

Taking a step further, we aimed to analyze the representation of fMRI scans within each subject. In the NSD dataset, each subject performed multiple task sessions (image-memory task) as well as a few resting-state sessions (spontaneous rest). The Figure 10B shows the within-subject t-SNE analysis of the representation extracted by the SG-MAE pretrained with HCP3T. Remarkably, in most cases, the resting-state fMRI runs are well separated from task fMRI runs and exhibit distinct representations. This result further suggests that within individuals, the pre-trained representation carries meaningful information reflecting one’s brain state.

E.1.2 Brain Arousal State

In interpreting the representations extracted by the model fine-tuned on the NSD dataset for downstream task performance prediction, we conducted PCA analysis on the fMRI representations (output of CLS token). Intriguingly, as shown in Figure 11, we observed a close relationship between the second principal component and the drowsiness index, a metric for measuring the brain arousal level [9, 16]. This finding suggests a convergence between our data-driven approach and traditional neuroscience studies that quantify brain states with multimodal equipment. It implies that the proposed method could serve as a valuable tool to sufficiently identify brain states using fMRI alone, obviating the need for additional modalities such as EEG.

Appendix F HCP Transient Mental State Decoding

Table 9 presents detailed results for each mental state from the multi-class decoding task, demonstrating consistently high decoding accuracy that appears to be insensitive to variations in state duration.

Table 9: Results for transient mental state decoding.

Task	Mental States		Model F1-score(%)
Task	Name	Duration(second)	SG-BrainMAE	AG-BrainMAE	vanilla-BrainMAE	vanilla-BrainAE
Working Memory	body	27.5	94.98±1.78	95.02±1.19	93.58±3.19	89.42±1.88
	faces	27.5	95.80±2.28	96.11±1.65	95.52±0.99	92.50±1.98
	places	27.5	97.47±1.25	98.20±1.75	96.94±1.59	96.12±2.72
	tools	27.5	94.84±2.62	95.15±1.33	93.19±2.85	88.45±4.44
Gambling	win	28.0	86.86±2.40	89.63±3.16	86.52±2.46	82.07±3.04
Gambling	loss	28.0	87.27±1.09	89.21±2.88	87.57±1.98	80.90±3.77
Motor	left finger	12.0	98.95±1.46	99.25±1.31	98.67±1.23	98.40±1.71
	right finger	12.0	98.03±1.77	98.04±1.76	97.19±2.08	97.45±2.29
	left toe	12.0	98.20±1.58	98.35±1.63	96.97±2.46	97.16±2.58
	right toe	12.0	97.62±2.06	98.21±1.87	96.99±2.56	97.08±2.71
	tongue	12.0	98.81±1.87	98.66±1.45	97.96±2.03	98.54±1.72
Language	story	25.9	97.52±1.93	97.73±2.14	97.58±1.68	97.73±1.24
Language	math	16.0	98.86±0.99	98.96±0.93	98.65±0.56	98.57±0.67
Social	interaction	23.0	98.22±1.38	97.84±1.02	97.95±0.93	97.59±0.88
Social	no interaction	23.0	98.55±1.38	97.61±1.57	97.64±1.38	96.92±1.14
Relational	relational	16.0	93.74±1.89	94.20±2.43	92.48±3.47	90.83±3.45
Relational	matching	16.0	93.83±2.25	93.42±1.62	92.03±2.32	90.58±2.02
Emotion	fear	18.0	97.08±1.35	97.69±1.71	96.20±2.15	96.51±1.77
Emotion	neutral	18.0	97.39±1.78	97.89±1.41	97.74±1.20	97.24±1.23
Rest	Rest	864.0	80.82±9.76	80.62±1.05	87.24±5.46	82.05±6.12

Appendix G Simulation Analysis

We first define N networks (or communities), where nodes within each network are functionally connected and exhibit similar fluctuations. We then assign each Region of Interest (ROI) a vector that represents the probability of belonging to each network. For every synthetic fMRI run, ROIs are allocated to networks based on their probability (the higher the probability, the greater the likelihood of the ROI being assigned to that network). ROIs within the same network are assumed to exhibit identical fluctuation profiles, adhering to the principle of "wire together, fire together". We simulate the time series for each ROI using a time-varying sine wave. ROIs within the same network share a common phase with small random perturbations, whereas ROIs from different networks are characterized by distinct phase profiles. Across different synthetic fMRI runs, the same ROI may be classified into different networks.

For the experiment, we utilize N=10 networks and 100 ROIs. SG-BrainMAE is pre-trained on this synthetic dataset with hyper-parameters same to the setting described in Appendix A.1 and Table 6.

Appendix H Additional Results

H.1 Ablation Study on ROI Embedding

Given that the position of the ROIs is static, there exists a possibility that the learned ROI embedding may predominantly encode information about the absolute position rather than the functional characteristics of the ROI. This hypothesis is investigated with two analyses from different perspectives.

Analysis 1. We substitute ROI embedding in the SG-BrainMAE model with position embedding, which is then frozen throughout the pretraining phase on the HCP-3T dataset while kee** other model components unchanged. This adapted model is named PosSG-BrainMAE. Evaluations involving both the reconstruction of masked signals on an independent HCP-7T dataset (see Figure 12) and performance in downstream tasks (see Tabel 10, 12, and 12) reveal a decreased efficacy compared to SG-BrainMAE. This decline in performance justifies the valuable ROI information contained in learned ROI embeddings, extending beyond mere positional information.

Analysis 2. We cyclic shift each ROI’s fMRI signal by random time steps. By doing this, the shifted fMRI signal exhibit two properties: 1. each fMRI signal itself is merely changed; 2. elimination of the inter-relationship between pairs of ROIs. We then follow the same pre-training procedure using this modified dataset. The ROI embeddings learned from this dataset did not exhibit functional specificity and inter-regional connectivity (see Figure 3), in contrast to those learned from the actual dataset (see Figure 13). These findings provide additional evidence that the proposed method learns meaningful ROI information from the dataset, including its relationship to other ROIs.

H.2 Ablation Study on Self-Loop Removal

Removing self-loops in our model avoids the attention mechanism favoring its own node, encouraging it to aggregate signals from other relevant nodes. This approach is akin to a group voting system and helps reduce sensitivity to input noise. To validate this design choice, we conducted a comparative analysis for downstream tasks between SG-BrainMAE and a similar model that includes self-loops, named SG-BrainMAE(SL). The results, presented in Tables 10, 12, and 12, show a slight decrease in performance for SG-BrainMAE(SL), indicating the effectiveness of excluding self-loops in our model.

Table 10: More ablation study: behavior prediction.

Model	Gender		Behaviors (measured in MAE)
Model	Accuracy(%)	AUROC	PicSeq	PMT $\_$ CR	PMT $\_$ SI	PicVocab	IWRD	ListSort	LifeSatisf	PSQI
SG-BrainMAE	97.49±0.15	97.46±0.18	5.06±0.21	1.63±0.08	1.24±0.04	3.40±0.14	1.11±0.04	4.35±0.12	3.64±0.27	1.05±0.06
PosSG-BrainMAE	95.74±1.08	95.98±0.96	5.99±0.30	1.83±0.09	1.41±0.08	3.80±0.08	1.28±0.05	4.83±0.20	4.02±0.08	1.22±0.08
SG-BrainMAE(SL)	96.63±1.56	96.62±1.55	5.53±0.15	1.72±0.06	1.32±0.07	3.61±0.10	1.15±0.03	4.64±0.06	3.87±0.10	1.14±0.04

Table 11: More ablation study: age prediction.

Model	Gender		Aging(MAE)
Model	Accuracy(%)	AUROC	Aging(MAE)
SG-BrainMAE	92.67±1.07	92.51±1.07	5.75±0.44
PosSG-BrainMAE	88.38±2.93	88.28±3.19	6.66±0.71
SG-BrainMAE(SL)	91.29±1.34	91.45±1.33	5.68±0.31

Table 12: Task performance prediction.

Model	Task Accuracy	RT(ms)
SG-BrainMAE	0.069±0.004	90.678±1.767
PosSG-BrainMAE	0.080±0.004	100.064±4.439
SG-BrainMAE(SL)	0.078±0.003	98.469±1.675

H.3 Comparison with Self-supervised Learning on Gender Classification on HCP-3T dataset

We conducted comparative analysis between our method and another recent self-supervised learning approach, named TFF, which employs 3D Convolutional Neural Networks (CNNs) and transformer to extract volumetric representations of fMRI data at each time point, with pre-training via auto-encoding. The results, as shown in Table 13, demonstrate that our model outperforms TFF in the HCP gender prediction downstream task.

Table 13: HCP: gender classification

Model	Accuracy(%)
TFF	94.09
Ours
SG-BrainMAE	97.49±0.15
AG-BrainMAE	97.13±0.56

H.4 Comparison with Traditional Machine Learning Models

Given the effectiveness and prevalence of traditional machine learning (ML) models in neuroimaging communities, this section focuses on assessing the added performance benefits of utilizing complex deep learning-based methods in comparison to these simpler ML models.

For regression tasks, we consider a suite of linear models, including ordinary linear regression, ridge regression, and elastic net. In the context of classification tasks, we explore the use of logistic regression, linear Support Vector Machine (SVM), and Random Forest models. Each of these models is trained to make predictions based on the flattened upper-triangle of the Functional Connectivity (FC) connectivity matrix.

We employ cross-validated grid search approach, with specific ranges and increments for each model for hyperparameter selection of each ML model:

•

Support Vector Machine (SVM): We vary the L2 regularization coefficient from 0.1 to 10, with an increment of 0.5.
•

Logistic Regression: The L2 regularization coefficient is tuned from 0.1 to 10, with an increment of 0.5.
•

Random Forest: Three key parameters are tuned: a. Number of trees, ranging from 1 to 250 with increments of 50. b. Maximum depth of each tree, from 5 to 50 with increments of 10. c. Minimum samples required to split a node, from 5 to 100 with increments of 20.
•

Ordinary Linear Regression: This model did not require hyperparameter tuning.
•

Ridge Regression: The L2 regularization coefficient is tuned from 0 to 10, with an increment of 0.5.
•

Elastic Net Regression: a. The coefficient of the L2 penalty (ridge regression component) is tuned from 0 to 10, with increments of 0.5. b. The coefficient of the L1 penalty (lasso regression component) is adjusted from 0 to 1, with increments of 0.2.

For models requiring multiple hyperparameters, we train for each possible combination. The best-performing model is selected based on its performance on the validation set, using Mean Squared Error (MSE) for regression tasks and accuracy for classification tasks.

The results of this comparative analysis across three different downstream tasks are shown in Table 14, 15, 16 and 17. It reveals that for classification tasks, traditional ML methods demonstrate performance levels comparable to those of baseline deep learning methods. This observation can be attributed to the well-established understanding that the functional connectivity matrix harbors significant information pertinent to human traits and age. However, for more complex regression tasks, such as task performance prediction, which necessitate inferring intricate brain states from the dynamics of fMRI signals, ML models often exhibit less satisfactory performance. In such scenarios, deep learning methods, endowed with their robust capability for representation learning, are able to achieve markedly superior results.

Table 14: HCP-3T gender classification

Model	Gender
Model	Accuracy(%)	AUROC	specificity(%)	sensitivity(%)	F1 Score(%)
FIX-FC
BrainNetTF	94.11±0.98	94.39±0.90	95.36±0.70	93.24±2.08	93.69±1.01
VanillaTF	90.00±1.05	89.93±0.94	91.36±1.25	88.32±2.51	88.71±1.10
BrainNetCNN	90.68±1.80	90.89±1.55	93.82±1.64	87.40±3.80	89.56±1.44
Dynamic-FC
STAGIN-SERO	88.73±1.36	88.69±1.41	89.99±1.71	86.29±1.74	86.80±0.90
STAGIN-GARO	88.34±0.94	88.33±0.91	89.50±2.90	86.76±4.73	97.18±0.13
FBNETGNN	88.05±0.15	87.93±0.97	89.60±1.21	86.11±1.32	86.51±0.89
Machine Learning Model
SVM	87.55±1.79	87.53±1.91	90.16±2.31	84.31±2.05	85.76±2.07
Logistic Regression	92.16±0.77	92.10±0.71	93.18±1.17	90.91±2.04	91.16±0.84
Forest	77.16±2.63	77.42±2.64	85.23±2.05	67.09±4.21	72.29±3.43
Ours
SG-BrainMAE	97.49±0.15	97.46±0.18	97.66±0.91	97.28±1.19	97.18±0.13
AG-BrainMAE	97.13±0.56	97.17±0.61	97.62±0.95	96.14±0.76	96.76±0.61

Table 15: HCP-3T behavior prediction

Model	Behaviors (measured in MAE)
Model	PicSeq	PMT $\_$ CR	PMT $\_$ SI	PicVocab	IWRD	ListSort	LifeSatisf	PSQI
FIX-FC
BrainNetTF	7.11±0.22	2.28±0.11	1.72±0.12	4.70±0.10	1.56±0.03	5.96±0.07	4.96±0.22	1.49±0.05
VanillaTF	8.19±0.46	2.73±0.07	2.13±0.07	5.93±0.14	1.81±0.08	6.91±0.21	5.71±0.08	1.68±0.05
BrainNetCNN	10.21±0.22	3.25±0.13	2.64±0.14	6.65±0.27	2.26±0.05	8.51±0.20	7.12±0.24	1.68±0.05
Dynamic-FC
STAGIN-SERO	10.22±0.15	3.49±0.05	2.70±0.06	6.78±0.20	2.26±0.05	8.51±0.20	7.12±0.24	2.12±0.04
STAGIN-GARO	10.26±0.18	3.44±0.10	2.69±0.09	6.92±0.30	2.25±0.04	8.52±0.26	7.09±0.35	2.08±0.04
FBNETGNN	8.62±0.21	2.93±0.11	2.34±0.11	5.83±0.15	1.96±0.04	7.31±0.10	6.09±0.10	1.81±0.03
ML Model
Oridinary Regression	11.23±0.25	4.05±0.07	3.34±0.07	7.50±0.13	2.58±0.08	9.90±0.26	8.05±0.18	2.50±0.07
Ridge	8.91±0.23	3.18±0.10	2.63±0.10	6.05±0.10	2.05±0.06	7.70±0.23	6.42±0.02	2.01±0.09
ElasticNet	10.83±0.14	4.01±0.03	3.29±0.04	7.44±0.22	2.35±0.05	9.21±0.19	7.29±0.28	2.15±0.07
Ours
SG-BrainMAE	5.06±0.21	1.63±0.08	1.24±0.04	3.40±0.14	1.11±0.04	4.35±0.12	3.64±0.27	1.05±0.06
AG-BrainMAE	5.09±0.05	1.67±0.10	1.28±0.06	3.34±0.11	1.13±0.03	4.37±0.06	3.58±0.17	1.07±0.05

Table 16: HCP-Aging age prediction

Model	Gender					Aging(MAE)
Model	Accuracy(%)	AUROC(%)	specificity(%)	sensitivity(%)	F1 Score(%)	Aging(MAE)
FIX-FC
BrainNetTF	90.21±3.81	90.73±2.85	87.97±9.26	91.84±6.56	89.35±3.49	6.15±0.71
VanillaTF	88.96±2.16	88.76±2.21	89.22±2.30	88.56±3.01	87.54±2.53	6.78±0.56
BrainNetCNN	88.83±1.52	88.74±1.58	89.87±2.85	87.92±3.18	87.48±1.20	8.71±0.62
Dynamic-FC
STAGIN-SERO	82.37±1.66	82.57±1.36	85.23±4.68	76.00±6.51	77.91±3.14	8.96±0.47
STAGIN-GARO	80.67±0.81	80.58±1.03	84.63±4.05	77.95±4.43	78.81±3.27	8.65±0.28
FBNETGNN	89.50±3.58	89.34±3.49	90.04±1.94	89.05±5.18	88.35±3.45	6.68±1.00
ML Model
SVM	86.04±0.97	86.23±0.98	90.87±0.66	79.84±1.76	83.36±1.48	-
Logistic Regression	89.49±1.27	89.41±1.23	90.92±0.81	87.66±2.17	88.20±1.50	-
Forest	73.75±1.44	74.42±0.94	85.82±1.46	58.35±4.11	66.05±2.47	-
OrdinaryRegression	-	-	-	-	-	7.63±0.21
Ridge	-	-	-	-	-	7.08±0.20
ElasticNet	-	-	-	-	-	9.43±0.61
Ours
SG-BrainMAE	92.67±1.07	92.51±1.07	97.66±0.91	97.28±1.19	97.18±0.13	5.78±0.44
AG-BrainMAE	91.12±1.99	91.15±2.03	97.62±0.95	96.14±0.76	96.76±0.61	6.49±1.00

Table 17: NSD Task performance prediction

Model	Task Accuracy	RT(ms)
FIX-FC
BrainNetTF	0.070±0.003	92.344±2.343
VanillaTF	0.075±0.004	96.252±2.133
BrainNetCNN	0.078±0.004	102.911±2.225
Dynamic-FC
STAGIN-SERO	0.089±0.003	116.635±2.197
STAGIN-GARO	0.091±0.002	116.130±2.099
FBNETGNN	0.074±0.005	95.349±2.320
ML Model
Ordinary Regression	0.117±0.003	157.460±3.788
Ridge	0.093±0.003	129.609±2.518
ElasticNet	0.110±0.004	174.946±2.585
Ours
SG-BrainMAE	0.069±0.004	90.678±1.767
AG-BrainMAE	0.070±0.004	92.154±2.265

Table 18: Behaviors description

Behavior	Display Name	Description
PicSeq	NIH Toolbox Picture Sequence Memory Test: Unadjusted Scale Score	The Picture Sequence Memory Test is a measure developed for the assessment of episodic memory for ages 3-85 years. Participants are given credit for each adjacent pair of pictures,the maximum score is 17(because that is the number of adjacent pairs of pictures that exist).
PMT_CR	Penn Progressive Matrices: Number of Correct Responses (PMAT24_A_CR)	Penn Matrix Test: Number of Correct Responses. A measure of abstraction and mental flexibility. It is a multiple choice task in which the participant must conceptualize spatial, design and numerical relations that range in difficulty from very easy to increasingly complex.
PMT_SI	Penn Progressive Matrices: Total Skipped Items (PMAT24_A_SI)	Penn Matrix Test: Total Skipped Items (items not presented because maximum errors allowed reached).
PicVocab	NIH Toolbox Picture Vocabulary Test: Unadjusted Scale Score	This measure of receptive vocabulary is administered in a computerized adaptive format. The respondent is presented with an audio recording of a word and four photographic images on the computer screen and is asked to select the picture that most closely matches the meaning of the word.
IWRD	Penn Word Memory Test: Total Number of Correct Responses (IWRD_TOT)	Participants are shown 20 words and asked to remember them for a subsequent memory test. They are then shown 40 words (the 20 previously presented words and 20 new words matched on memory related characteristics).
ListSort	NIH Toolbox List Sorting Working Memory Test: Unadjusted Scale Score	This task assesses working memory and requires the participant to sequence different visually- and orally-presented stimuli. Pictures of different foods and animals are displayed with both a sound clip and written text that name the item. Participants are required to order a series of objects (either food or animals) in size order from smallest to largest.
LifeSatisf	NIH Toolbox General Life Satisfaction Survey: Unadjusted Scale Score	Life Satisfaction is a concept within the Psychological Well-Being subdomain of Emotion. Life Satisfaction is one’s cognitive evaluation of life experiences and is concerned with whether people like their lives or not. This self-report measure is a 10-item calibrated scale comprised of items from the Satisfaction with Life Scale.
PSQI	Sleep (Pittsburgh Sleep Questionnaire) Total Score	The Pittsburgh Sleep Quality Index (PSQI) is a self-rated questionnaire which assesses sleep quality and disturbances over a 1-month time interval.Scores for each question range from 0 to 3, with higher scores indicating more acute sleep disturbances.

Appendix I Additional fMRI Signal Reconstruction Results By BrainMAE

Additional reconstruction examples of applying the HCP-3T pre-trained SG-BrainMAE to various unseen fMRI datasets are shown in Figure 14 for the HCP-7T dataset, Figure 15 for the NSD Task dataset, and Figure 16 for the HCP-Aging dataset. These examples demonstrate the generalizable representation learned by BrainMAE.

Model	Accuracy(%)	macro F1-score
Self-supervised baselines
CSM	94.8±0.35	92.0
Seq-BERT	89.20±0.35	84.5
Net-BERT	89.80±0.48	85.2
Ours
vanilla-BrainAE	94.24±1.06	93.31±1.07
vanilla-BrainMAE	95.56±1.37	94.93±1.52
SG-BrainMAE	95.71±1.30	95.24±1.15
AG-BrainMAE	95.98±1.62	95.59±1.51