Self-supervised Learning for Electroencephalogram: A Systematic Survey

Weining Weng [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 , Yang Gu [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 , Shuai Guo [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 , Yuan Ma [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 , Zhaohua Yang [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 , Yuchen Liu [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 and Yiqiang Chen [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083

(2023)

Abstract.

Electroencephalogram (EEG) is a non-invasive technique to record bioelectrical signals. Integrating supervised deep learning techniques with EEG signals has recently facilitated automatic analysis across diverse EEG-based tasks. However, the label issues of EEG signals have constrained the development of EEG-based deep models. Obtaining EEG annotations is difficult that requires domain experts to guide collection and labeling, and the variability of EEG signals among different subjects causes significant label shifts. To solve the above challenges, self-supervised learning (SSL) has been proposed to extract representations from unlabeled samples through well-designed pretext tasks. This paper concentrates on integrating SSL frameworks with temporal EEG signals to achieve efficient representation and proposes a systematic review of the SSL for EEG signals. In this paper, 1) we introduce the concept and theory of self-supervised learning and typical SSL frameworks. 2) We provide a comprehensive review of SSL for EEG analysis, including taxonomy, methodology, and technique details of the existing EEG-based SSL frameworks, and discuss the difference between these methods. 3) We investigate the adaptation of the SSL approach to various downstream tasks, including the task description and related benchmark datasets. 4) Finally, we discuss the potential directions for future SSL-EEG research.

Self-supervised learning, electroencephalogram, contrastive learning, representation learning

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Theory of computation Models of computation^†^†ccs: Applied computing Computational biology^†^†ccs: Computing methodologies Artificial intelligence

1. Introduction

Electroencephalography (EEG) is a neurophysiological technique that records and measures the brain’s electrical activity. The EEG signals are collected in a non-invasive way that involves placing electrodes on the scalp to measure and record the electrical impulses generated by the brain (Teplan et al., 2002). Due to the characteristic that the EEG signals are the external representation of the inner brain neural activity, which contains abundant neural information related to various brain stimuli, EEG signals have been widely studied to deal with different real-world tasks: for example, epilepsy recognition (Gotman, 1982), emotion recognition (Bos et al., 2006), sleep research (Rodenbeck et al., 2006), and the brain-computer interface application (Al-Quraishi et al., 2018). Therefore, the EEG signal is an incredible tool in neuroscience and possesses an exceptionally high clinical utility, which generally became the research focus of physiological signals.

Recently, with the fast development of deep learning and artificial intelligence, machine learning and deep learning models are integrated with labeled samples to complete different classification (Siuly et al., 2016), regression (Sabbagh et al., 2020), and generation (Fahimi et al., 2019) tasks. The combination of intelligent algorithms and labeled EEG datasets with supervised learning modes has emerged as a powerful tool to enhance the analysis and interpretation of EEG data. Traditional machine learning methods such as Support Vector Machine (SVM), RandomForeast, and Multi-Layer Perceptron (MLP) demonstrated their efficiency in detecting significant patterns from different hand-crafted EEG features (Hosseini et al., 2020). Some simple EEG-based tasks, such as EEG-based event classification, emotion recognition, epilepsy detection, and motor imagery classification, can be automatically performed by machine learning models (Du et al., 2015; Palo et al., 2015; Zhang and Chen, 2016). The end-to-end deep learning frameworks composed of the Convolutional Neural Network (CNN), Long-Short Term Memory Network (LSTM), Transformer(Vaswani et al., 2017), and other networks are implemented to model the spatial correlation between electrodes and the temporal variation of the EEG signals. Deep learning methods contain more parameters and complex network structures, with stronger learning and expression abilities to extract physiological information and recognize complex patterns. Adequate labeled EEG data and powerful deep learning models are critical elements for intelligent EEG analysis. Moreover, relying on the large amount of high-quality labeled EEG data, deep models trained with supervised modes can accomplish complex EEG tasks.

The most critical challenge intelligent EEG analysis faces is the scarcity of labeled samples. While training deep models demands extensive labeled data. However, obtaining a large-scale labeled EEG signal for model training is impractical. The annotation of the EEG signal necessitates manual intervention from experts well-versed in neurophysiology, possessing a profound familiarity with the distinctive features of interest embedded within the EEG data. The high costs and the need for expert knowledge in the annotation process make constructing EEG datasets extremely challenging. In addition, the scarcity of specific brain states significantly affects the acquisition of EEG signals (Rafiei et al., 2022). For example, abnormal emotion states and seizure states are relatively rare among the subjects, making it more difficult for sample collection. Therefore, building annotated EEG datasets for training deep models is constrained by various factors. It necessitates the involvement of domain experts (Ge et al., 2021), demanding substantial time and cost (Chen et al., 2022a), which poses a significant challenge for the application of supervised learning in EEG analysis.

Refer to caption — Figure 1. The general process of SSL-integrated EEG analysis. The black arrows represent forward propagation, the green and blue arrows denote backpropagation based on pretext task loss and downstream task loss, respectively.

Besides, supervised EEG analysis faces the inconsistency problem, which severely impacts the effectiveness of supervised learning. Interpreting EEG signals often involves subjectivity and variability among the subjects and evaluators (Cimtay and Ekmekcioglu, 2020). First, some tasks generally collected signals annotated by the participants, which have a vital subjective component and do not necessarily represent the actual states generated by their brains, leading to inconsistencies in the labels. Besides, owing to the distinctive differences in each individual’s brain, substantial variations exist in brain signals among different subjects. This diversity may result in evaluative discrepancies in labels annotated by domain experts, as different experts might assign different labels to the same EEG segment (Li et al., 2019). Such variations introduce an inconsistency in the labeled samples. Therefore, mitigating the significant influence of inconsistency issues in the training process and improving generalization ability became critical problems faced by the EEG analysis.

Self-supervised learning (SSL) has shown its superior performance in solving the challenges mentioned above, which leverages the intrinsic structure and information within data to train models without labels. Self-supervised learning designed a series of pretext tasks different from the final modeling target that generate the pseudo-label directly from the unlabeled samples to train the model (Ericsson et al., 2022). In the Computer Vision (CV) and Natural Language Process (NLP), self-supervised learning has achieved tremendous success. In CV, SSL structure helps the model to learn the effect image representation through the pretext tasks such as image rotation (Gidaris et al., 2018), jigsaw (Noroozi and Favaro, 2016), and reconstruction (He et al., 2022), which significantly improved the downstream task performance, sample efficiency and mitigate the overfitting problem (Jaiswal et al., 2020). In NLP, the mask-reconstruction (MAE) and the prompt answering pretext tasks help the language model to comprehensively understand textual context, enabling a series of functions of machine translation and conversation system (Devlin et al., 2018). Therefore, the strong representation ability and low-labeled sample requirements of the self-supervised learning paradigm demonstrate its potential as an effective training method, which offers new insights and tools for addressing various complex problems in different domains.

Implementing the SSL frameworks in the EEG field is gaining more and more attraction among various researchers (Rafiei et al., 2022). Figure 1 illustrates a typical SSL-integrated EEG analysis method. There have been certain studies investigating the combination of SSL with EEG analysis, which conduct the preliminary exploration of SSL to deal with temporal physiological signal-based tasks. Accordingly, this paper comprehensively reviews the utilization of self-supervised learning for EEG analysis, which provides an in-depth exploration of the taxonomy, the pros and cons, and the development potential of the EEG-based SSL frameworks. The main contributions of this paper are listed as follows:

(1) Comprehensive review. This paper provides a comprehensive up-to-date review of the self-supervised learning integrated EEG analysis methods. We analyze the technique details of different SSL approaches for EEG signals, including the type of pretext tasks, the mathematical description, the performance of the SSL, and some simple summaries. By comparing different methods, we outline the general process and characteristics of the EEG-based SSL methods.

(2) Systematic and reasonable taxonomy. Following the classical taxonomy of the traditional self-supervised learning methods, we rigorously categorize existing studies on self-supervised learning in EEG into four major classes: the prediction-based method, the generation-based method, the contrastive-based method, and the graph-based method.

(3) Future potential directions. We also analyze the pros and cons of various methods, identify the limitations of current works, and take into account the inherent characteristics of EEG data to indicate the potential directions for develo** SSL-based EEG analysis.

2. Preliminary

This section provides a concise overview of traditional supervised EEG-analysis methods. In addition, we outline the form definition and mathematical description of the self-supervised learning frameworks proposed in other fields (CV, NLP), which serves as a preliminary of the EEG-based self-supervised method.

2.1. Supervised EEG Analysis

EEG signals have been widely studied to decode brain activity for addressing various real-world tasks. For instance, EEG have been used to recognize specific emotion (Zheng and Lu, 2015), detect seizure (Alotaiby et al., 2014), classify sleep stage (Boostani et al., 2017), recognize motor imagery (Altaheri et al., 2023), decode visual or auditory information (Kalafatovich et al., 2020), etc. Machine learning and deep learning supervised methods have been widely adopted to analyze EEG signals, extract features, and complete specific tasks. Existing studies can be classified into two categories (Weng et al., 2022): the feature-driven and the model-driven methods, where the feature-driven methods combine the handcrafted features with traditional machine learning classifiers to interpret EEG signals, and the model-driven methods construct end-to-end deep learning models to automated extract task-related EEG features.

Feature-driven methods. The feature-driven methods use specific features extracted from EEG signals to guide the analysis process. In general, the feature-driven methods select handcrafted features that have been proven effective for the task according to the previous research by neuroscientists (Shoeibi et al., 2021). By leveraging the selected features through traditional machine learning classifiers, the models can uncover patterns, relationships, and insights in understanding EEG signals and brain activity. Various handcrafted features fed into different models are extensively applied to multiple tasks. For example, time domain features like the Hjorth parameter (Oh et al., 2014), the high order crossing (Petrantonakis and Hadjileontiadis, 2009), the statistical analysis features (Übeyli, 2009), etc; the frequency domain features like different independent frequency bands (Cimtay and Ekmekcioglu, 2020) generated through Fast Fourier Transfer and differential entropy (Duan et al., 2013), etc; the temporal frequency domain features which combined the frequency features with the time window to introduce the variation of frequency features overtime (Übeyli, 2009). Utilizing these manually engineered features as input, machine learning have demonstrated a dependable performance in tasks such as emotion recognition, sleep stage classification, and motor imagery classification (Hosseini et al., 2020).

Model-driven methods. Model-driven methods refer to approaches that incorporate deep end-to-end models to interpret and analyze temporal EEG raw data or the high dimensional EEG features. Deep models can capture specific spatial-temporal information to infer underlying brain dynamics, quantify brain activity, and complete complex EEG-based classification or regression tasks (Craik et al., 2019). The existing model-driven methods are typical examples of supervised deep learning approaches that rely extensively on a substantial volume of training samples. Owing to the powerful learning capabilities of deep learning and the assistance of extensively labeled samples, the efficacy of models has been further heightened across diverse complex EEG tasks.

2.2. Overview of Self-supervised Learning

Self-supervised learning can extract effective representation from unlabeled samples instead of directly training end-to-end models through labeled samples, which has shown its superior performance in learning spatial images and sequential context representation in the fields of CV and NLP. In this part, we outline the mathematical definition of the SSL, explain the terms of essential concepts, and briefly divide the existing SSL frameworks into four distinct categories based on the variation in pretext tasks.

Term explanation. We provide important definitions of terms to help further understand self-supervised learning.

•

Pretext task. The pretext tasks $T=\{t_{1},t_{2},...,t_{n}\}$ refers to the learning objective or task designed to leverage the content or structure within unlabeled data to help the model learn knowledge and effective representations. The learned representations can then be transferred to downstream tasks with limited labeled data.
•

Pseudo-label. The pseudo-label $Y_{p}$ is the artificial label created based on the pretext tasks to train the model. The pseudo-labels serve as a form of supervision that guides the self-supervised learning process to extract specific features from unlabeled samples.
•

Downstream task. The definition of downstream task $d_{t}$ is the target or final task to be performed using the features or representations learned from the previous phase of training (by pretext tasks). The downstream task typically requires labeled samples to fine-tune the previous model to transfer the representation model to become more specific and task-focused toward the downstream task.
•

Human-label. The human-label refers to the labels for the downstream samples annotated by human experts.

Mathematical definition. The objective of self-supervised learning is to learn a function $\mathcal{F}(x)\rightarrow\mathbb{R}^{d}$ that maps the input instances $X$ to a $d$ -dimensional representation space $\mathbb{R}^{d}$ capturing essential features from unlabeled samples. The frameworks of self-supervised learning are generally regarded as the encoder-decoder structure encompassing an encoder $f_{\theta}$ to generate representation and several decoders $g$ to decode the representation to complete different tasks: pretext task decoder $g_{\delta}^{p}$ cascade with the encoder to accomplish pretext task and pre-train the model without external labels, and downstream task decoder $g_{\xi}^{d}$ can recognize specific patterns in the representation to fine-tune the model adopted to complete downstream tasks. In general, the training paradigms of SSL can be summarized into three categories: 1. pre-train mode; 2. joint-train mode; 3. unsupervised-train mode. The first mode is the pre-train mode, which uses pretext tasks to pre-train the representation model and downstream tasks to fine-tune the encoder and downstream task decoder to transfer the model for addressing specific tasks. The process can be expressed as follows:

(1)

\begin{split}\theta,\delta=\mathop{\arg\min}\limits_{\theta,\delta}\mathcal{L}% _{pt}(g_{\delta}^{p}(f_{\theta}^{p}(X)),Y_{p})\\ \theta,\xi=\mathop{\arg\min}\limits_{\theta,\xi}\mathcal{L}_{ft}(g_{\xi}^{d}(f% _{\theta}^{p}(X)),Y)\end{split}

where $\mathcal{L}_{pt}$ and $\mathcal{L}_{ft}$ represent the loss of pretext task and downstream task, respectively. The encoder is trained by the pretext task and fine-tuned by the downstream task to first generate effective representation and then transfer the learned knowledge into the specific task. The downstream task decoder is trained by the downstream task loss to fully leverage the representation for target task completion.

The second mode is the co-train mode, where a joint loss function is constructed to leverage pretext and downstream tasks to jointly train the model. The pretext task collaboratively explores the relevant knowledge for the downstream task and also serves as the regularization term to constrain the gradient during training, thereby mitigating the overfitting problem. This mode can be expressed as follows:

(2)

\theta,\xi=\mathop{\arg\min}\limits_{\theta,\delta,\xi}\alpha\mathcal{L}(g_{% \delta}^{p}(f_{\theta}^{p}(X)),Y_{p})+\beta\mathcal{L}(g_{\xi}^{d}(f_{\theta}^% {p}(X)),Y)

where $\alpha$ and $\beta$ are the hyper-parameter to balance different losses.

The third mode is the unsupervised-train mode, which is similar to the pre-train mode, but the parameters of the encoder are frozen during the fine-tuning stage. This mode only fine-tunes the downstream task decoder to verify the generated representations’ effectiveness. The process can be formulated as follows:

(3)

\begin{split}\theta,\delta=\mathop{\arg\min}\limits_{\theta,\delta}\mathcal{L}% _{pt}(g_{\delta}^{p}(f_{\theta}^{p}(X)),Y_{p})\\ \theta,\xi=\mathop{\arg\min}\limits_{\xi}\mathcal{L}_{ft}(g_{\xi}^{d}(f_{% \theta}^{p}(X)),Y)\end{split}

Within the pretext task based taxonomy of SSL, we can categorize the SSL method into four types: predictive-based, generative-based, contrastive-based, and hybrid SSL method. Figure 2 demonstrates the general taxonomy of SSL, and the detailed explanations are as follows:

Predictive-based SSL method. The predictive-based SSL method creates classification pretext tasks to predict discrete pseudo labels generated from unlabeled data to learn effective features. For instance, pretext tasks like predicting image rotation angles (Gidaris et al., 2018) and pixel colors (Zhang et al., 2016) can force the model to extract spatial features and object boundaries in the images beneficial for the downstream tasks such as object detection; pretext tasks like predicting next sentence (Devlin et al., 2018) can help language model understand contextual correlation. Due to their simpler execution nature, predictive-based SSL methods are mostly easy to combine with traditional deep models, and the proficient performance in prediction tasks signifies that the model has mastered specific knowledge for downstream tasks.

Generative-based SSL method. The generative-based SSL method designs generation or reconstruction pretext tasks to capture contextual features and correlations to generate effective representations. The most widely used generative-based pretext task is the reconstruction task. This task begins by encoding the input sample into a distinctive representation, followed by a decoding process to reconstruct the original input. By making the input and output as similar as possible, the encoder can learn significant features to reconstruct the input, which are highly effective for target downstream tasks. For example, typical methods like autoencoder (Zhai et al., 2018) have been investigated to extract representations from image and textual data. Recently, the mask-reconstruction pretext task has supplanted traditional reconstruction tasks to extract the contextual information from unlabeled samples. This task masks part of the input samples and reconstructs the masked data through the contextual data, where the encoder is responsible for extracting features and generating representations, and the decoder is responsible for reconstructing the masked data. In vision tasks, the masked autoencoder (He et al., 2022) can extract spatial contextual features from unlabeled samples for downstream classification and segmentation. In language tasks, the BERT model captures token-level context correlation information, which greatly improves the performance of subsequent tasks such as machine translation and text generation.

Contrastive-based SSL method. The contrastive-based SSL method adopts the ’comparison’ technique, which encourages similar data points to be closer in the representation space while pushing dissimilar data points apart. Augmentation methods are important in the constrastive-based SSL: input samples are augmented to create negative and positive sample pairs, where the positive pairs represent the similar samples, and negative pairs refer to the vastly dissimilar samples (Wang and Qi, 2022). By optimizing the designed contrastive loss, the model minimizes the distance between positive pairs and maximizes the distance between negative pairs to extract identical features and transferable representations. Based on the theory of information bottleneck (Tishby et al., 2000) and mutual information (Kraskov et al., 2004), InfoNCE loss (Hjelm et al., 2018) is proposed to efficiently learn representations where positive pairs are closer together in the feature space compared to negative pairs. Besides, SimCLR (Chen et al., 2020), MoCo (He et al., 2020), and other contrastive learning methods have become important frameworks driving the development of computer vision.

Hybrid SSL method. The hybrid SSL method combines multiple SSL techniques or tasks to create a powerful framework for learning representations. The main idea is to leverage the strengths of different pretext tasks to capture diverse and informative features from unlabeled samples. The weighted fusion of losses from multiple pretext tasks enables the model to grasp multi-dimensional knowledge. It is particularly valuable when data is heterogeneous or a single pretext task may not capture all the relevant information in the unlabeled samples (Liu et al., 2022).

Following the taxonomy of typical SSL frameworks in vision and language fields, this paper categorizes self-supervised EEG analysis methods into predictive, generative, contrastive, and hybrid frameworks. Comprehensive summary for different methods are provided from Section 3 to Section 6. The structure of this survey can be visualized in Figure 3.

3. predictive-based SSL EEG analysis method

The predictive-based SSL EEG analysis method aims to design and execute classification to acquire domain-specific knowledge beneficial for various downstream tasks. Multi-channel EEG signals present distinctive characteristics, including high temporal density, pronounced temporal dependencies, and intricate inter-channel correlations, indicating the presence of critical features within the temporal, frequency, and spatial domains of EEG data. Sequentially, pretext tasks are implemented to distinguish EEG samples that are augmented through temporal, frequency, and spatial processing to acquire features from different domains. Therefore, we can categorize the existing studies into three sub-categories: (1) spatial predictive methods, (2) temporal predictive methods, and (3) transformation predictive methods. The typical frameworks of three kinds of methods are demonstrated in Figure 4 and Figure 5, and the summary of existing works is listed in Table 1.

3.1. Spatial Predictive Method

The spatial predictive method draws inspiration from SSL in the image domain, establishing local or global spatial-structure-related pretext tasks to help the model comprehend spatial contextual information. Figure 4a shows the typical spatial predictive framework for EEG analysis, and different methods have been investigated to extract channel correlation and brain structure, which are listed as follows:

EEG jigsaw task(Li et al., 2022a) is analogous to the image jigsaw pretext task in CV. EEG jigsaw task involves the random shuffling of EEG channels, followed by an expectation that the model can reconstruct the original sequence of the scrambled channels or predict the order in which the channels were shuffled. For example, assuming the raw EEG data $X_{sp}\in{\mathbb{R}}^{c\times t}$ , where $c$ is channel numbers and $t$ is the number of sampling points. The random shuffling operation then produces a permuted EEG matrix $X^{*}_{sp}\in{\mathbb{R}}$ , where the temporal information remains unchanged, but the channel order has been shuffled. The loss function of the jigsaw task can be described as follows:

(4)

\mathcal{L}(X^{*}_{sp},Y^{*})=-\sum_{i=1}^{N}Y^{*}_{i}\log(g_{\delta}^{p}(f_{% \theta}(X^{*}_{sp})))

where $Y^{*}$ represents the one-hot pseudo-label (channel order), and $N$ represents the batch size. The loss function calculates the cross-entropy between the predicted order of the shuffled EEG sample and its corresponding label. By minimizing this pretext loss, researchers believed the model can capture spatial features related to the distribution of multi-channel EEG signals across the cortical regions of the brain, which are closely related to downstream tasks such as emotion recognition and seizure detection.

Channel correlation prediction(Cai et al., 2023) is designed to realize the spatial correlation between different channels. Researchers proposed that the time delay exists in the propagation of EEG signals between channels in distinct brain regions. EEG signals will experience delays when propagating from one region to a distant one, and exploring these features enables the model to understand the activation modes and information exchange patterns of brain activity. In this task, pseudo labels are generated from signal correlation between channels, which can be calculated as follows:

(5)

\displaystyle Y(i,j,t_{1},t_{2})=\left\{\begin{aligned} 1,cossim(X_{i}(t_{1}),% X_{j}(t_{2}))\geq\kappa\\ 0,cossim(X_{i}(t_{1}),X_{j}(t_{2}))<\kappa\end{aligned}\right.

This function calculates the cosine similarity between the $i$ -th channel and the $j$ -th channel at time slices $t_{1}$ and $t_{2}$ . The $cossim$ represents the cosine similarity, and the $\kappa$ is the predefined threshold to determine whether the two slices are correlated and assign the pseudo label. The binary cross-entropy loss can be used as loss prediction loss to pre-train the model, where the predictions are generated by the encoder-pretext task decoder structure: $g_{\delta}^{p}(f_{\theta}(X_{1},X_{2}))$ .

Replace discriminative task(Cai et al., 2023) is the binary classification task to extract channel-specific differential features by identifying distinct components from different channels. In this task, a random replacement is performed to replace a certain percentage $p_{r}\%$ of original EEG signal $X_{i}$ with signal $\overline{X_{i}}$ sampled at any channels and time slices. The pseudo labels are constructed to indicate whether the current samples have been replaced, which can be described as follows:

(6)

\displaystyle Y(X_{i})=\left\{\begin{aligned} 1,f_{I}(X_{i},i)=0\\ 0,f_{I}(X_{i},i)\neq 0\end{aligned}\right.

where $f_{I}(X_{i},i)$ is the function to judge whether the signal $X_{i}$ has been replaced or not. Subsequently, by minimizing the binary cross-entropy pretext task loss, the model learns the distinctive spatial features of different channels and retains essential information beneficial for various downstream tasks.

3.2. Temporal Predictive Method

The temporal predictive methods aim to capture the temporal correlation and sequential dependencies in EEG signals. As the temporal physiological signal, temporal characteristics play an important role in various EEG-based tasks. Figure 4b shows a typical framework of temporal predictive SSL for EEG analysis, and different temporal predictive pretext tasks have been proposed to investigate potential temporal information, which are summarized as follows:

Relative positioning task is the temporal predictive method to distinguish whether two different EEG segments are close or distinct in time dimension (Banville et al., 2021). This task firstly constructs an EEG pair $X_{t_{i}},X_{t^{{}^{\prime}}_{i}}\in{\mathbb{R}}^{c\times t}$ represent two sampled EEG segments. Representation of EEG signals should change slowly over time, which means EEG segments proximate in the time dimension convey similar information, and those further apart exhibit significant dissimilarities (Banville et al., 2021). The duration parameter $\tau_{pos}$ controls the duration of positive context. For two EEG segments $X_{t_{i}}$ and $X_{t^{{}^{\prime}}_{i}}$ , the temporal interval $|t_{i}-t^{{}^{\prime}}_{i}|\leq\tau_{pos}$ indicates that these segments are within positive duration, sharing common underlying labels. Therefore, the pseudo labels of the relative positioning task can be constructed as follows:

(7)

\displaystyle Y(X_{t_{i}},X_{t^{{}^{\prime}}_{i}})=\left\{\begin{aligned} 1,|t% _{i}-t^{{}^{\prime}}_{i}|\leq\tau_{pos}\\ -1,|t_{i}-t^{{}^{\prime}}_{i}|>\tau_{pos}\end{aligned}\right.

where $-1$ and $1$ represent samples of negative and positive duration. The training sample $S_{N}=\{(X_{t_{i}},X_{t^{{}^{\prime}}_{i}}),Y((X_{t_{i}},X_{t^{{}^{\prime}}_{i% }})\}$ can be used to train the model with binary classification loss to capture temporal information. This method has been widely used for continuous EEG classification tasks such as the sleep stage classification (Banville et al., 2021).

Temporal Shuffling is considered as the variation of the relative positioning task (Banville et al., 2021). The temporal shuffling task first samples two different EEG segments $X_{t_{i}},X_{t^{{}^{\prime}}_{i}}$ from positive duration, and then samples another EEG segment $X_{t^{{}^{\prime\prime}}_{i}}$ between the first two segments or in the negative duration. Three different segments form the triplet $(X_{t_{i}},X_{t^{{}^{\prime}}_{i}},X_{t^{{}^{\prime\prime}}_{i}})$ . The shuffling operation is performed to permute the order of the segments in the triplet randomly. The pseudo labels indicating whether the triplet has been shuffled can be constructed: 0 for the shuffled triplet and 1 for the normal triplet. Then, the model learns to distinguish whether the triplet has been shuffled through the concatenated differential features between segments, which can be calculated as follows:

(8)

D(X_{t_{i}},X_{t^{{}^{\prime}}_{i}},X_{t^{{}^{\prime\prime}}_{i}})=concat(|f_{% \theta}(X_{t_{i}})-f_{\theta}(X_{t^{{}^{\prime}}_{i}})|,|f_{\theta}(X_{t^{{}^{% \prime}}_{i}})-f_{\theta}(X_{t^{{}^{\prime\prime}}_{i}})|)

where $cancat$ is the vector concatenation operation. The model conducts shuffling classification by utilizing differential encoded information from different segments as features, which can help to comprehend temporal dependencies within EEG signals. Besides, another temporal shuffling method proposed by (Ou et al., 2022) divides the EEG slice into three equidistant sequences, then randomly shuffles the order of sequences to form the shuffled sample. The model is asked to predict the order of input shuffled samples to capture temporal correlation. Therefore, for shuffled EEG signals, both binary classification (predict whether they have been shuffled ) and multi-class classification (predict the order of shuffled signals) can serve as the pretext task to extract temporal features of physiological signals at different granularities.

Temporal trend prediction(Ko and Suk, 2022) is a task to identify the potential trends of EEG to capture short-term and long-term dynamic patterns. This task divides the EEG signal into three categories according to its temporal characteristics: stationary, trendstationary, and cyclostationary. By learning how to identify temporal trends, the model can comprehend the temporal relationships within signals and capture both global and local essential waveform information to generate the temporally enriched representations, which can benefit a variety of downstream tasks, like sleep stage classification.

Time shift prediction(Accou et al., 2023) is a task to predict the time shift performed to the EEG signals by contrasting the differences in features between the raw EEG signal and shifted signals. In this task, the raw EEG signal $X_{t_{i}}$ and augmented EEG signal $X_{t_{i}+\rho}$ resulting from $\rho$ -step time shifts applied to the raw EEG signal. The raw signal and shifted signal are encoded into representations, and the pretext task uses a classification method to analyze the difference between the two representations and classify how much the raw signal was shifted. By minimizing the classification loss, the encoder can learn the temporal-aligned features and dependencies within EEG signals, generating the representation containing rich time information significantly beneficial for long-term EEG tasks like clinical monitoring.

3.3. Transformation Predictive Method

Figure 5 shows the general process of the transformation predictive method. This task aims to predict specific transformations applied to the EEG signals to learn signal-related features in the time-frequency domain. Different EEG transformation techniques employed to augment EEG samples to be recognized in this task can be listed as follows:

Stopped band prediction randomly removes specific frequency bands in EEG signals and forces the model to predict the index of the removed channel to learn frequency-related features (Jo et al., 2023). EEG signals comprise information from multiple frequency bands, with essential information concentrated within the frequency range of 1 to 50 Hz, encompassing five independent frequency bands: $\delta$ (0.5-4Hz), $\theta$ (4-8Hz), $\alpha$ (8-12Hz), $\beta$ (12-30Hz) and $\gamma$ (30-50Hz). This task transforms the EEG signal from time to the frequency domain and remaps the signal to the time domain after the random removal of a specific frequency band, and the pseudo labels $Y(x_{i})\in[0,1,2,3,4]$ are set representing the index of the removed band. By forcing the model to predict the stopped band through encoded representation $f_{\theta}(X_{i})$ , the encoder $f_{\theta}$ can learn efficient frequency correlation and features to form the temporal-frequency representation.

Multi-transformation recognition aims to predict the transformation technique used to augment EEG signals to extract fine-grained signal features and form effective representations (Wang et al., 2023). In this task, EEG signals are augmented through one transformation technique, and the model is asked to recognize the transformation methods. The common encoder $f_{\theta}$ extracts features from augmented EEG signals and encodes them into representation, with multiple binary classifiers to recognize different transformation methods. Each classifier corresponds to a specific transformation method to determine its occurrence. Six transformation methods are proposed to be recognized:

(1)Noise Adding. Adding random noise generated by Gaussian distribution $N~{}(\mu,\sigma^{2})$ . The noise $NS_{i}$ directly added to the original signal to $X_{i}$ , resulting in a noise-augmented signal $X^{ns}_{i}$ .

(2)Scale transformation alters the waveform of the EEG signals. The amplitude of EEG signal is stretched or telescoped through a scale factor $\alpha$ , where the scale-augmented signal can be expressed as $X^{st}_{i}=\alpha*X_{i}$ .

(3)Horizontal flip**. This transformation method directly flips EEG signal horizontally. The horizontal-augmented signal can be expressed as $X^{h}_{i}=-X_{i}$ .

(4)Vertical flip** flips EEG signal (each sample) vertically. The vertical-augmented signal can be described as $X^{v}_{i}=flip(X_{i})$ , where the $flip$ represents the segment’s vertical symmetric flip.

(5)Temporal dislocation is consistent with the temporal shuffling in predictive methods. This method divides EEG segments into sub-segments and randomly shuffles these sub-segments to form the dislocation-augmented signal $X^{td}_{i}$ .

(6)Time war** method randomly stretches and compresses sub-segments to form the augmented samples. This method randomly selects sub-segments to stretch and compress, with the recombination method to reassemble all the sub-segments and construct the war**-augmented signal $X^{tw}_{i}$ with the same dimension as the origin EEG signal.

By recognizing different transformation techniques, the model can generate the representation that captures temporal dependencies, frequency correlation, and time-frequency correspondences within EEG signals from unlabeled samples.

Table 1. The summarization of predictive-based EEG analysis self-supervised learning method. In the ”training mode” column, ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, and ”CT” represents joint-training mode.

Approach	Type of pretext method	Detailed method	Backbone	Downstream Tasks	Training Mode
EEG scaling SSL (Xu et al., 2020)	transformation predictive	scaling prediction	SVM	epileptic classification	PT
Transformation SSL (Wang et al., 2023)	transformation predictive	multi-transformation	CNN	emotion recognition	UT
Task-agnostic SSL (Partovi et al., 2023)	transformation predictive	multi-transformation	CNN	seizure & motor imagery	UT
Temporal EEG SSL (Gramfort et al., 2021)	temporal predictive	relative position	-	sleep & pathology prediction	PT
SSL-EED AD (Zheng et al., 2022)	transformation predictive	noise classification	CNN,SVM	pathology prediction	UT
Speech-EEG SSL (Accou et al., 2023)	temporal predictive	temporal-shift	CNN	speech decoding	UT
SSL MI-EEG (Ou et al., 2022)	temporal predictive	temporal shuffling	CNN	motor imagery	PT
MtCLSS (Li et al., 2022b)	transformation predictive	multi-transformation	CNN	pediatric sleep classification	CT
MM Emotion (Montero Quispe et al., 2022)	transformation predictive	multi-transformation	CNN	Emotion recognition	PT
SSTSC (Xi et al., 2022)	transformation predictive	Relative position	CNN	Seizure detection	PT
Clinical EEG SSL (Banville et al., 2021)	temporal predictive	relative position temporal shuffling	CNN	sleep classification pathology classification	PT
SSL for sleep EEG (Banville et al., 2019)	temporal predictive	relative position temporal shuffling	CNN	sleep classification	PT
EEG-oriented SSL (Ko and Suk, 2022)	transformation predictive temporal predictive	band-stop prediction temporal-trend	CNN	sleep & Pathology motor imagery	PT
Robust EEG SSL (Jo et al., 2023)	transformation predictive temporal predictive	band-stop prediction temporal-trend	CNN	sleep & Pathology	PT
MBrain (Cai et al., 2023)	Spatial predictive	channel correlation replace discriminative	CNN,LSTM	Seizure detection	PT

3.4. Section Discussion

This section extensively reviews the predictive-based EEG analysis methods. In this section, we categorize the predictive methods into three sub-categories: spatial predictive, temporal predictive, and transformation predictive method. The spatial predictive tasks focus on exploring the channel-correlation features. In contrast, the temporal predictive tasks involve incorporating rich temporal dependency features, time-correlation features, and consistent temporal information into the representation. The transformation predictive task can help the model to extract temporal-frequency aligned features by recognizing typical signal transformation techniques. Those pretext tasks are simple to accomplish, where the encoder $g_{t}heta$ can be CNN or LSTM, and the pretext task decoder $g_{\delta}^{p}$ can be simple forward neural networks or the traditional machine learning classifiers. The predictive tasks only require a few parameters and complex network architectures but may need help to learn general representations for downstream tasks.

4. Generative-based SSL EEG analysis method

Different from predictive methods, generative-based SSL EEG analysis methods are more complex and challenging. The critical terms of this method are ”Reconstruction” and ”Generation,” where the fine-grained correlation and features can be captured through this pretext task. ”Reconstruction” means reconstructing the masked or transformed samples to learn effective representation, and ”Generation” means generating specific context to train the model to learn specific knowledge. In the EEG analysis, the generative-based SSL method adopts signal reconstruction and

generative-adversarial task as the pretext tasks, which can be categorize into three independent sub-categories according to the task target: (1) Temporal reconstruction task, (2) Multi-domain reconstruction task, and (3) Generative adversarial task. The typical frameworks of three kinds of methods are demonstrated in Figure 6 and Figure 7, and the summary of existing works is listed in Table 2.

4.1. Temporal Reconstruction Task

The framework of the temporal reconstruction task is shown in Figure 6a, which is inspired by the autoencoder method (Zhai et al., 2018) to reconstruct the input data to capture contextual features without the need for human-labeled sample. The reconstruction task enables the encoder to learn fine-grained input correlation, which can generate representations containing rich contextual information. The EEG signal is the serialized temporal physiological data, which is applicable for conducting the temporal reconstruction pretext task (Devlin et al., 2018) to learn signal contextual correlation, enhance the understanding of temporal dependencies, and provide effective representation for various EEG-based downstream tasks. Different temporal reconstruction tasks are listed as follows:

EEG-based autoencoder (Huang et al., 2023) is the adaptation of autoencoder for EEG analysis (Hinton and Salakhutdinov, 2006). In this method, EEG signals are encoded into low-dimensional representation by the encoder $f_{\theta}$ , and the low-dimensional representation is then used to reconstruct the original signal through the pretext task decoder $g_{\delta}^{p}$ symmetrical to the encoder. The encoder is responsible for preserving critical EEG signal information, while the decoder is responsible for reconstructing the EEG signal from the generated representation. The reconstruction loss can be calculated as follows:

(9)

\mathcal{L}(X,X^{*})=\frac{1}{len(X)}\|X-X^{*}\|_{1}

where $len(X)$ represents the length of input signal and $\|\|_{1}$ represents the L1-norm. $X$ is the EEG signal and $X^{*}$ is the reconstructed signal. By minimizing the difference between the original and reconstructed signals, the encoder preserves critical information necessary for signal recovery, which can be considered signal compression.

Signal-level mask-reconstruction (signal MAE) is the typical mask-reconstruction method to capture temporal signal correlation for reconstructing the masked segments (Chien et al., 2022). In this framework, multi-channel EEG signals are first encoded into temporal embeddings through the 1D convolution block similar to the famous wave2vec (Schneider et al., 2019; Baevski et al., 2020) algorithm. The high-dimensional EEG signals $X$ are downsampled and compressed into low-dimensional feature embeddings $Z=\{z_{1},z_{2},z_{3},...,z_{k}\}$ arranged in temporal order, and the stride of convolution block determines the number of input time-step to the encoder. The mask $M_{i}$ is generated to randomly replace parts of the information in embedding $Z$ , creating the masked embedding $Z^{m}$ with local information dropout. The transformer encoders are then applied to extract bidirectional temporal correlation between input slices and output the signal representation $Re=\{re_{1},re_{2},...,re_{k}\}$ . The convolution block and Transformer encoder are cascade as the encoder $f_{\theta}$ to generate signal representation, and followed by the linear and convolution layer as the pretext task decoder $g_{\delta}^{p}$ to reconstruct the raw signal $X$ through masked signal representation $Re$ , the training process can be described by minimizing cosine similarity loss shown as follows:

(10)

\mathcal{L}(X_{m},X)=1-\frac{X_{m}\cdot X}{|X_{m}||X|}

where $X_{m}$ is the reconstructed EEG signal. This method reconstruct raw signal through the masked signal representation, where the original EEG signals serve as the pseudo-labels. In this architecture, encoder $f_{\theta}$ is responsible for mining temporal correlation and preserving critical signal information while the pretext decoder $g_{\delta}^{p}$ is responsible for reconstructing original EEG signals. Therefore, this framework forces the encoder to generate representations containing fine-grained correlation and signal critical information, which exhibits strong expressiveness, generalization, and applicability across various EEG-based tasks.

Embedding-level mask-reconstruction (embedding MAE) is another mask-reconstruction method that is inspired by the BERT model(Devlin et al., 2018) in the language domain to fuse the contextual relationship into the representation (Kostas et al., 2021). Similar to the MAE framework mentioned above, EEG signals are first transformed into temporal embeddings through the convolution block and then encoded by the transformer encoder to generate signal representations, followed by the pretext task decoder to accomplish the reconstruction task. However, different from the signal MAE, this task is based on the embedding-level reconstruction: the transformed EEG embedding $Z$ is randomly masked by the generated mask vector, where $z^{*}$ represents the randomly selected embeddings to be masked. The pretext task decoder is required to reconstruct the masked embedding rather than EEG raw signals. The contrastive loss function is designed to make the reconstructed embedding $z^{*}_{pre}$ to be as similar as possible to the original unmasked embedding $z^{*}$ while kee** it as dissimilar as possible to the remaining embeddings, which can be calculated as follows:

(11)

\mathcal{L}(z^{*},z^{*}_{pre})=-\log\frac{exp(sim(z^{*}_{pre},z^{*})/\eta)}{% \sum_{z_{r_{i}}\in{Z}}exp(sim(z^{*}_{pre},z_{r_{i}})/\eta)}

where $z_{r_{i}}$ is the negative sample obtained by random sampling from the contextual embeddings, $sim$ represents cosine similarity to measure the distance between the reconstructed and original embeddings, and $\eta$ is the temperature parameter to control the contrastive loss. Compared with the MAE to reconstruct the original signals, the embedding-level reconstruction is simpler, with fewer parameters to capture critical contextual information precisely and understand EEG embedding temporal relationships. However, it may also lead to losing some original signal information. The combination of the transformation encoder can generate representations for various downstream tasks.

4.2. Multi-domain Reconstruction Method

The multi-domain reconstruction method extends the EEG-based MAE to multiple domains (signal, spatial, and frequency), which can be shown in Figure 6b. Different from the temporal reconstruction methods, this method achieves collaborative and mutual reconstruction across different domains to extract spatial-temporal-frequency aligned and complementary features in the EEG signal, generating more powerful and general representations adapted to different tasks. Detailed explanations of multi-domain reconstruction methods are as follows:

Spatial-temporal-frequency reconstruction (STF MAE) conducts the synergistic reconstruction task in the temporal-frequency-spatial domain to extract integrated EEG features (Chen et al., 2023). The idea of synergistic reconstruction task is inspired by the time-frequency analysis method (Roach and Mathalon, 2008): temporal analysis method (Balconi and Lucchiari, 2006; Singh and Malhotra, 2022) investigates the patterns of EEG amplitude changes over time, while frequency analysis (Harpale and Bairagi, 2016) method studies the frequency energy distribution within EEG signals. The time-frequency analysis method utilizes the sliding time window to investigate temporal changes in frequency spectral features (Zhang et al., 2008). Based on the time-frequency analysis method, this task constructs a 3D matrix as the feature of the EEG signal. Through the continuous wavelet transform (CWT) (Aguiar-Conraria and Soares, 2014), EEG signals are transformed into 3D frequency-spatial-temporal matrix $X\in\mathbb{R}^{c\times t_{n}\times f_{r}}$ , where $c$ is the channel number, $t_{n}$ is the number of temporal slices, and $f_{r}$ represents frequency feature resolution. This 3D matrix can be considered time-frequency features (2D image) with multiple channels. Inspired by the image MAE, the EEG feature matrix is divided into different patches and randomly masked with the mask patch $m_{p}$ to generate the masked matrix $X^{*}$ , the encoder-decoder structure utilizing vision-transformer (ViT) (Dosovitskiy et al., 2020) as the backbone is designed to reconstruct the EEG feature matrix. The mean squared error (MSE) can be used to train the model, which is defined as follows:

(12)

\mathcal{L}(X^{m},X^{pre})=E(X^{pre}-X^{m})^{2}=\frac{1}{n_{m}}\sum_{i=1}^{n_{% m}}(X_{i}^{pre}-X_{i}^{m})^{2}

where $n_{m}$ is the dimension of the masked features $X^{m}$ , and $X^{pre}$ represents the reconstructed features generated by the encoder and pretext task decoder: $X^{pre}=g_{\delta}^{p}(f_{\theta}(X^{*}))$ . By minimizing the MSE loss, encoder $f_{\theta}$ fuses spatial-temporal-frequency contextual correlation into representation, and decoder $g_{\delta}^{p}$ is learned to reconstruct the original EEG feature matrix based on the representation. The generated representations contain multi-domain correlation and features, which exhibit greater expressive ability and a wider range of applications for downstream tasks.

Frequency mask-reconstruction (frequency MAE) conducts mask-reconstruction task in different frequency bands to capture frequency features, long-term dependencies, and critical time-frequency correlated information (Peng et al., 2023). Initially, the EEG signal undergoes two distinct transformations: 1. The EEG signal is directly embedded into the patch sequence through division, linear projection, and flattening operation, representing the EEG temporal patch sequence. 2. The EEG signal is transformed into six independent frequency bands (0-4Hz, 4-8Hz, 8-18Hz, 16-32Hz, 32-64Hz, and other frequencies), representing the EEG frequency patch sequences. $10\%$ of the frequency patch sequences are randomly masked, followed by six independent ViT-based encoders to generate representations for all frequency bands (one encoder corresponds to one frequency band). Six independent ViT-based decoders are sequentially used to reconstruct the frequency patch sequences. Differently, the target of the pretext task is to minimize the difference between the summation of all reconstructed frequency sequences and the temporal patch sequence, which can be calculated similarly to equation (12). This task can reconstruct the temporal information by the masked frequency patch sequences, which can help the model to align temporal-frequency information in the EEG signal and understand its correlation, generating high-dimensional representations with rich time-frequency coherent features of the signals, and providing valuable EEG features for various EEG-based tasks.

Frequency-temporal reconstruction (FT MAE) is the framework to reconstruct the masked EEG representations in the frequency and time domain (Wu et al., 2022). This framework transforms EEG signals into discrete patches through the non-overlap** 1D-CNN, with some patches randomly masked with ratio $r$ . The ViT-based encoder is subsequently employed to generate representations, followed by a symmetric decoder to reconstruct the masked patches. Two reconstruction methods are proposed in the framework: the first is the spatiotemporal domain reconstruction, where the decoder reconstructs the masked patches directly, with the MSE loss function to train the model and capture the temporal correlations in EEG signal. The second is the Fourier domain reconstruction to reconstruct masked patches in the frequency domain. Through the Discrete Fourier Transform (DFT) (Bagchi and Mitra, 2012), EEG signals can be transformed from the time domain to the frequency domain:

(13)

x^{f}_{k}=\sum_{j=1}^{n}x_{j}*\cos(\frac{2\pi}{n}jk)-\mathbf{i}*\sin(\frac{2% \pi}{n}jk)

where $k\in(0,n)$ , $n$ is the number of sampling points for the EEG segments, $x_{j}$ is the temporal amplitude at sampling point $j$ , and $\mathbf{i}$ represents the imaginary unity. $x^{f}_{k}$ represents the generated spectrum features at sampling point $k$ . The first term in this equation represents the Real part of the result, and the second term represents the imagery part. Then, the magnitude and phase of the frequency signal can be calculated as follows:

(14)

\displaystyle\left\{\begin{aligned} magnitude_{k}=\frac{1}{n}\sqrt{Re(x^{f}_{k% })^{2}+Im(x^{f}_{k})^{2}}\\ phase_{k}=atan2(Re(x^{f}_{k})^{2},Im(x^{f}_{k})^{2})\end{aligned}\right.

where $Re$ and $Im$ represent the real and imagery part of the spectrum feature, and $atan2$ represents the arctangent function with two arguments. Researchers believe that the study of both magnitude and phase is important: For EMG signals, the magnitude and phase are highly correlated with muscle movement. Muscles move both longitudinally and transversely according to the direction of the fibers. As a result, the biological impedance of the motion units changes, leading to variations in amplitude and phase responses. Therefore, analyzing magnitude and phase can help the model capture muscle contraction patterns as part of the representation learning process (Wu et al., 2022). In the EEG signal, the magnitude and phase are highly related to the phase synchronization information between neurons, which can help reveal the synchrony and information transmission between different brain regions. The Fourier domain reconstruction task predicts the magnitude and phase sequence of masked EEG patches, which are then reconstructed through the inverse Fourier transform. The mean squared error can measure the difference between the original patches and those reconstructed by magnitude and phase. Encoder $f_{\theta}$ can understand the correlation between spectrum features and temporal signal and capture critical neuron activity knowledge through this task.

Spatial reconstruction (Spatial MAE) aims to learn the spatial correlation between different channels in EEG signal (Ho and Armanfard, 2023). In this framework, the correlation between EEG channels can be defined using a graph structure $\mathcal{G}=(\mathcal{A},\mathcal{X})$ , where $\mathcal{X}\in\mathbb{R}^{c\times n}$ represents the node feature matrix in the graph (each channel corresponds to a node on the graph) and $\mathcal{A}\in\mathbb{R}^{c\times c}$ is the adjacency matrix representing the connectivity between nodes. The graph structure can be calculated through channel spatial distance and correlation. In the framework, the sub-graph $\mathcal{G}^{s}$ is sampled containing $n_{s}$ nodes and their connectivity graph structure. For the sampled sub-graph, the feature of a random node is masked and then reconstructed by the model through adjacent node features and graph structure, which can train the model to capture the spatial correlations. The graph neural network (GNN) (Wu et al., 2020) is used as the backbone for the encoder and decoder to deal with topological graph data, and the MSE is the loss to measure the node reconstruction performance. By reconstructing the graph node, the generated representation enables a deeper exploration of spatial features and channel correlation, which is valuable for tasks that require high spatial resolution in EEG (such as visual decoding).

Transformation reconstruction aims to reconstruct EEG signals after different signal transformations to preserve critical signal-related information (Das et al., 2022). The model reconstructs the EEG signal after the following signal transformations: 1. signal jitter, where EEG samples are added with random noise: $x_{s}=x+s$ . 2. Random sample, where some points in the temporal EEG signal are replaced by the average value of neighbor points. This transformation can be considered as the smoothing operation. 3. Channel removal, where a specific channel in the EEG signal is removed to be reconstructed. 4. Window replace, where EEG signals in randomly selected time windows are replaced by dummy value zero. 5. Jitter in time windows, where the signal in the randomly selected time window is corrupted by noise. The model can fuse temporal correlation, spatial correlation, and transformation features into representation for downstream tasks by reconstructing raw signals from various transformations in the pre-training process.

4.3. Generative Adversarial Method

The generative adversarial method encompasses two pretext tasks: the generation task to generate fake EEG samples continually, and the adversarial task strives to distinguish real and fake samples (shown in Figure 7) (Creswell et al., 2018; Goodfellow et al., 2014). Through self-supervised the adversarial training of the generator and discriminator in the framework, the model can generate enhanced EEG samples. In the field of EEG, two kinds of generative adversarial networks (GANs) have been investigated, which are listed as follows:

Sample generation method aims to produce new EEG samples through the generation and adversarial pretext tasks (Bhat and Hortal, 2021). This framework uses the generator $G$ and discriminator $D$ to accomplish the generation adversarial task. The input of $G$ are augmented EEG signals (e.g. masked signal) or random noise, and the output of $G$ are the generated fake EEG samples. The input of $D$ are the sample pairs $(x^{n},x)$ , where $x^{n}$ is the generated fake sample and $x$ is the true EEG signal. The generator aims to produce pseudo-samples highly similar to real EEG samples, while the discriminator attempts to distinguish between real and fake samples accurately. Through adversarial training, the generator can produce highly believed EEG samples for training, which can help alleviate EEG collection and labeling issues.

Discriminator-based GAN is another generative adversarial method to extract discriminative representations from EEG signal (Fu et al., 2022). Discriminator-based GAN focuses on the discriminator to extract efficient features. By distinguishing real samples from fake ones, the discriminator can learn critical invariant and discriminative features of EEG signals. Through adversarial training, the discriminator can be considered as the encoder that can extract pre-trained EEG features and generate representations for downstream tasks.

4.4. Section Discussion

This section reviews the generative-based SSL EEG analysis methods, which conduct complex generative pretext tasks to train the encoder to capture effective signal features for downstream tasks. The existing methods are categorized into three sub-categories: 1) The temporal reconstruction task that masks part of the temporal signal and requires the model to reconstruct. 2) The multi-domain reconstruction task that masks temporal-frequency features and requires the model to reconstruct. 3) The adversarial generative task that generates pseudo sample by generator and requires the discriminator to distinguish real and fake samples. Compared with the predictive tasks, generation tasks are more challenging and need more trainable parameters and complex structures to accomplish, they can learn more efficient features in representation. Emulating the MAE, BERT, and other generative SSL methods in the vision and language field, generative SSL methods for EEG signals have achieved significant success in various downstream tasks.

Table 2. The summarization of generative-based EEG analysis self-supervised learning method. ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, ”CT” represents joint-training mode, and ”SA” represents the sample augmentation.

Approach	Sub-category	Detailed method	Backbone	Downstream Tasks	Training Mode
BENDR(Kostas et al., 2021)	Temporal reconstruction	Embedding MAE	CNN&Transformer	Multiple tasks	PT&UT
GANSER[(Zhang et al., 2022d)	Generative adversarial	Sample generation	CNN(U-NET)	Emotion recognition	SA
EEG-CGS(Ho and Armanfard, 2023)	Multi-domain reconstruction	Spatial MAE	GNN	Seizure analysis	PT
Eeg2vec(Zhu et al., 2023)	Temporal reconstruction	Embedding MAE	CNN&Transformer	Speech decoding	UT
MAEEG(Chien et al., 2022)	Temporal reconstruction	Signal MAE	Transformer	Sleep classification	PT&UT
Cognitive MAE(Pulver et al., 2023)	Temporal reconstruction	Embedding reconstruction	CNN&Transformer	Cognitive-load classification	PT&UT
SSLAPP(Lee et al., 2022)	Generative adversarial	Sample augmentation	Transformer	Sleep classification	SA
MV-SSTMA(Li et al., 2022c)	Multi-domain reconstruction	STF MAE	CNN&Transformer	Emotion recognition	PT
EpilepsyNet(Huang et al., 2023)	Temporal reconstruction	EEG-based autoencoder	CNN	Epileptic classification	JT
EEGMAE(Chen et al., 2023)	Multi-domain reconstruction	STF MAE	ViT	ASD classification	UT&PT
brain2vec(Lesaja et al., 2022)	Temporal reconstruction	Embedding MAE	CNN&Transformer	Speech decoding	PT&UT
Wavelet2vec (Peng et al., 2023)	Multi-domain reconstruction	FT MAE	ViT	Seizure detection	PT
CRT (Zhang et al., 2022b)	Multi-domain reconstruction	STF MAE	Transformer	Sleep classification	UT
Neuro2vec (Wu et al., 2022)	Multi-domain reconstruction	FT MAE	Transformer	Seizure & sleep	PT&UT
SDCAN(Fu et al., 2022)	Generative adversarial	Discriminator-based	CNN	Stress classification	JT
WGAN-GP(Bhat and Hortal, 2021)	Generative adversarial	Sample generation	CNN	Emotion recognition	SA
CWGAN(Jiao et al., 2020)	Generative adversarial	Sample generation	LSTM	Sleep classification	SA
SAE-EEG(Liu et al., 2020)	Temporal reconstruction	EEG-based autoencoder	CNN	Emotion recognition	PT
AE-CDNN(Wen and Zhang, 2018)	Temporal reconstruction	EEG-based autoencoder	CNN	Seizure detection	UT
MI-AE(Mirzaei and Ghasemi, 2021)	Temporal reconstruction	EEG-based autoencoder	CNN	Motor imagery	UT

5. contrastive-based SSL EEG analysis method

Contrastive learning is the most widely used SSL technique in EEG analysis. Contrastive learning framework combined with EEG augmentation methods have been investigated to generate representation that integrates invariant features between positive pairs while eliminating irrelevant features between negative pairs. The target of contrastive learning is to encourage the model to pull positive pairs (similar samples) closer together and push negative samples apart in the representation space, which is defined as follows:

(15)

\mathcal{L}_{con}\overset{\mathrm{def}}{=}max(d(x^{+},x)-d(x^{-},x)+\alpha,0)

This loss function is the triplet loss (Schroff et al., 2015) that trains the model to achieve $d(x^{+},x)>d(x^{-},x)+\alpha$ , and $\alpha$ is a small positive number to avoid clustering overfitting. Different augmentation methods are applied to EEG signals to form positive and negative pairs. According to the type of augmentation methods for generating positive and negative sample pairs, we can categorize the contrastive-based SSL EEG analysis method into five sub-categories: (1) Contrastive predictive coding, (2) transformation contrastive learning, (3) spatial contrastive learning, (4) composite contrastive learning, and (5) task-oriented contrastive learning. The typical frameworks of different kinds of methods are demonstrated in Figure 8 to Figure 12, and the summary of existing works is listed in Table 3.

5.1. Contrastive Predictive Coding

Contrastive Predictive Coding (CPC) is a self-supervised learning technique used in NLP and CV for learning high-level representations (Oord et al., 2018; Henaff, 2020). In CPC, data are divided into overlap** context windows, which are used to generate positive and negative pairs. The main idea of CPC is to generate the representation of the context window that can accurately predict the representation of future windows to extract shared invariant features. In the EEG field, two different CPC methods have been investigated:

EEG-based CPC extends the CPC for EEG analysis (Banville et al., 2021), which can be shown in Figure 8. This method divides EEG signals into time slices through the sliding windows. The context window $X_{c}$ contains $N_{c}$ samples is defined as $X_{c}=\{x_{t_{i}-N_{c}+1},...,x_{t_{i}}\}$ , where $t_{i}$ is the temporal index. Similarly, the following predictive window $X_{p}$ is defined as $X_{p}=\{x_{t_{i}+1},...,x_{t_{i}+N_{p}}\}$ , where $N_{p}$ is the length of the prediction window. The encoder $g_{en}$ calculates representation $z_{t}=g_{en}(x_{t})$ for context window and prediction window, generating context and prediction representation sequence $Z_{c}=\{z_{t_{i}-N_{c}+1},...,z_{t_{i}}\}$ and $Z_{p}=\{z_{t_{i}+1},...,z_{t_{i}+N_{p}}\}$ , separately. The integrated feature $c_{t_{i}}$ is calculated by a GRU-based regression encoder $g_{ar}$ that summarizes the information of representations within the context window. $c_{t_{i}}$ is used to predict future representations in the prediction window through weight $W_{k},k\in[1,N_{p}]$ , where $W_{k}c_{t}$ is the prediction for $z_{t_{i}+k}$ through the contextual feature $c_{t}$ . Positive and negative pairs are then constructed: the predicted representation $W_{k}c_{t_{i}}$ forms positive pairs with the corresponding original representation $z_{t_{i}+k}$ , while forming negative pairs with the remaining representations. The loss function is described as follows:

(16)

\mathcal{L}_{CPC}=-\frac{1}{|\mathcal{B}|}\sum_{t_{i}\in\mathcal{B}}\sum_{k=1}% ^{N_{p}}log\frac{exp(s(x_{t_{i}+k},W_{k}c_{t_{i}}))}{exp(s(x_{t_{i}+k},W_{k}c_% {t_{i}}))+\sum_{j\in N_{e}}exp(s(x_{j},W_{k}c_{t_{i}}))}

where $\mathcal{B}$ is the sample batch and $|\mathcal{B}|$ is the batch size. The $N_{e}$ indexes the negative samples of $W_{k}c_{t_{i}}$ , where $j\neq k+t_{i}$ . By minimizing the contrastive loss, the model can extract invariant temporal features from EEG signals and integrate long-term temporal dependencies within EEG signals to form representations, which can maximize the correlation between representation and EEG raw signal to preserve critical signal information for EEG-based downstream tasks.

EEG-based bidirectional contrastive predictive coding (BCPC) is the extension of CPC to extract bidirectional temporal correlation in EEG signals (Chen et al., 2022b). Unlike CPC, the BCPC method adds an additional backward prediction window in the framework, representing the EEG signal prior to the context window in the time dimension. The contextual feature $c_{t_{i}}$ is used to predict the representation in the prediction window and the backward prediction window to construct the positive and negative pairs. By adding the backward prediction window to introduce the reverse negative and positive sample pairs for contrastive learning, the bidirectional model can capture the contextual features with temporal semantic information from both directions in the EEG signal.

5.2. Transformation Contrastive Learning

The transformation contrastive learning method is inspired by the typical contrastive learning framework such as SimCLR (Chen et al., 2020)and MoCo (He et al., 2020) in CV. EEG signals are augmented into negative and positive sample pairs through the signal transformation methods designed according to the characteristics of temporal physiological signals. The framework is shown in Figure 9. Multiple transformation contrastive learning methods have been studied to solve different downstream tasks, the typical frameworks are listed as follows:

Signal-transformation contrastive emulates the typical framework SimCLR to conduct EEG contrastive learning (Mohsenvand et al., 2020). For the random selected EEG sample $x_{t}$ , different transformation methods are employed to generate augmentations $T_{1}(x_{t})$ and $T_{2}(x_{t})$ . This method leverages the concept that augmentations applied to the same sample yield similar information, forming positive pairs, while augmentations from distinct samples exhibit significant dissimilarity, constituting negative pairs. The encoder $f_{\theta}$ generates representations and pretext task decoder $g_{\delta}$ maps the representation into loss space to calculate the contrastive loss. For batch with size $|B|$ , the loss function is defined as follows:

(17)

\mathcal{L}=-\frac{1}{|\mathcal{B}|}\sum_{t=0}^{|\mathcal{B}|}log\frac{exp(sim% (z_{t_{1}},z_{t_{2}})/\tau)}{\sum_{i=1}^{2k}\mathbbm{1}_{[i\neq t]}exp(sim(z_{% t_{1}},z_{i})/\tau)}

where $z_{t_{1}}$ and $z_{t_{2}}$ is generated by $g_{\delta}^{p}(f_{\theta}(T_{1}(x_{t})))$ and $g_{\delta}^{p}(f_{\theta}(T_{2}(x_{t})))$ respectively, indicating the representations of augmentations from the same sample. $\mathbbm{1}_{[i\neq t]}\in\{0,1\}$ is the indicator function, and $\tau$ is the temperature parameter. By minimizing this loss function, the model can optimize the representation space to capture discriminative representations. In the EEG analysis domain, EEG augmentation methods can be applied using signal transformation methods mentioned in Section 3.3, and other EEG signal augmentation methods are listed as follows:

(1) Cutout & resize divides EEG signals into different segments, and one segment is randomly discarded, representing the ”cut out” operation. The remaining segments are then concatenated and resized to the length of the original sample. (2) Crop & resize divides EEG signals into different segments, and one segment is randomly chosen and resized to the length of the original sample. (3) Average filter, regarded as the smoothing operation, replaces some points in the signal with the value of several neighbor points. (4) Amplitude scaling. This method scales the temporal amplitude of the original EEG signal. The scale value should be between 0.5 and 2, suggested by prior research (Mohsenvand et al., 2020). (5) Time shift method shifts the EEG segments along the time dimension, representing the horizontal offset in temporal sampling. (6) Direct-current shift method shifts the EEG segments along the voltage dimension, representing the magnitude offset in temporal sampling. The model can learn invariant EEG features and understand its latent knowledge by conducting contrastive learning on those transformation-augmented EEG signals.

Non-negative EEG contrastive is the contrastive framework without negative samples (Yang et al., 2023). In traditional contrastive learning, the quantity and quality of negative samples play a crucial role in determining the effectiveness and quality of contrastive learning. In this framework, $z_{i}$ and $z_{j}$ represent the anchor and its positive samples through augmentation. To reduce the impact of negative pairs, this method proposes the world representation $z_{w}$ representing the average information of EEG signal, where $z_{w}=E_{k\sim p(.)}[z_{k}]$ is generated by random representation $z_{k}$ and the distribution $p(\cdot)$ . Based on the idea that the similarity between positive pairs should be greater than the similarity between anchor sample and global representation, the loss function is designed as follows:

(18)

l(i,j)=s(z_{i},z_{w})+\epsilon-s(z_{i},z_{j})

where $\epsilon$ is the empirical margin, and $s(\cdot,\cdot)$ is the Gaussian kernel to measure the similarity between input representations. By minimizing the loss, the model makes the similarity between the anchor sample and positive sample greater than the world representation to learn consistent EEG information between samples without human labels. Besides, some EEG analysis methods integrated with non-negative contrastive frameworks in CV like Barlow Twins (Zbontar et al., 2021) and BYOL (Grill et al., 2020) to conduct EEG-based non-negative contrastive learning, where all the augmented samples form the positive pairs with anchor sample and the well-designed loss function can extract invariant features from only positive pairs. Those methods are also non-negative contrastive frameworks without global representation.

5.3. Spatial Contrastive Learning

The spatial contrastive learning method shown in Figure 10 focuses on spatial information and utilizes channel-level spatial augmentation techniques (e.g., jigsaw, meiosis) on EEG signals to construct positive and negative sample pairs, from which the model can integrate efficient spatial features and channel correlation into representation. Typical methods are listed as follows:

Spatial shuffle contrastive method conducts the channel-shuffling technique to construct positive and negative pairs (Li et al., 2022a). In this method, EEG signal $x\in\mathbb{R}^{c\times t}$ is augmented through spatial shuffle: different EEG channels are categorized into different brain regions based on their spatial positions, generating $X^{B}=\{X^{1},X^{2},...,X^{M}\}$ , where $M$ is the number of brain regions and each region in $X^{B}$ contains features from multiple channels. $X^{B}$ is randomly shuffled and reassembled into the augmented EEG sample $X^{*}\in\mathbb{R}^{c\times t}$ . Each sample generates two shuffling augmentations to form positive sample pairs, and shuffling augmentations from different samples form negative pairs. InfoNCE loss described in equation (16) serves as the loss function for model training. The model can understand relationships between spatial channel location and signal features by contrasting the shuffling augmented EEG samples.

Graph contrastive method mines the relationship between channels using the graph structure (Ho and Armanfard, 2023; Ye et al., 2023). In this framework, EEG signals are embedded into node features in the graph, and the edges between nodes are calculated by the channel correlation or the spatial distance. Assuming $\mathcal{G}$ is the generated graph, $\mathcal{V}$ is the node-set, and $\mathcal{E}$ is the edge set, two augmentation methods are employed for contrastive learning: (1) Node drop** method. For a sample $\mathcal{G}_{t}$ , two augmented samples $\mathcal{G}^{1}_{t}$ and $\mathcal{G}^{2}_{t}$ are generated by randomly drop** nodes and their edges according to the drop** rate $r\%$ . The augmentations from the same sample form the positive pairs, and from different samples form the negative pairs. (2) Sub-graph augmentation. For each node $v_{i}$ in the sample $\mathcal{G}_{t}$ , two positive and one negative samples are constructed: The random walk with restart algorithm is used to generate positive sub-graph $\mathcal{G}^{+}_{i,1}$ and $\mathcal{G}^{+}_{i,2}$ centered at the selected node $v_{i}$ with the radius parameter $ra$ to control the size, and generates negative sub-graph $\mathcal{G}^{-}_{i}$ centered at the farthest node from the selected node. In the positive sub-graph, the features of target nodes are masked with zero to avoid interference from target node information. For different sub-graphs, the representations are encoded by the trainable weight $W_{e}$ through the GNN, where $re^{+}_{i,1}$ and $re^{+}_{i,2}$ are the representations for positive samples, and $re^{-}_{i,1}$ is the representation of negative sample. The embedding of the selected node is also calculated by $W_{e}$ , where $e_{i}=ReLU(v_{i}W_{e})$ . A trainable score matrix $W_{s}$ is then designed to quantify the similarity between the selected node and its sub-graphs, which is described as follows:

(19)

S^{+}_{i,j}=\sigma(e_{i}W_{s}{re}^{+}_{i,j})

where $\sigma$ represents the logistic function. The contrastive loss is designed to maximize the correlation between the embedding of nodes and positive samples, which makes the representation of channels in latent space closer to similar channels. The loss function is defined as:

(20)

\mathcal{L}=-\frac{1}{2c|\mathcal{B}|}\sum_{j=1}^{2}\sum_{i=1}^{c}(log(S^{+}_{% i,j}+log(1-S^{-}_{i,1})))

where $c$ is the number of channels and $|\mathcal{B}|$ is the batch size. By minimizing this loss, the model focuses on channel-level spatial features, which confers the model with the robust ability to comprehend high spatial resolution in EEG, leading to superior performance in downstream tasks involving multiple channels and complex channel configurations.

EEG meiosis contrastive method conducts meiosis augmentation technique and contrastive learning framework to integrate invariant channel features into representation (Guo et al., 2023). The meiosis data augmentation technique is used to generate contrastive pairs: two EEG samples are randomly sampled into the group $X^{g}_{i}=\{A_{i},B_{i}\}$ , where the format of samples $A_{i}$ and $B_{i}$ are $\mathbb{R}^{c\times t}$ , $c$ is the channel number and $t$ is the number of sampling points. For the group, $A_{i}=\{a_{1},a_{2},a_{3},...,a_{t}\}$ and $B_{i}=\{b_{1},b_{2},b_{3},...,b_{t}\}$ are engaged into meiosis with each other, which signifies data exchange between $A^{p}$ and $B^{p}$ , generating the augmented sample $V^{1}_{i}=\{a_{1},a_{2},a_{3},...,a_{i},b_{i+1},b_{i+2},...,b_{c}\}$ and $V^{2}_{i}=\{b_{1},b_{2},b_{3},...,b_{i},a_{i+1},a_{i+2},...,a_{c}\}$ . EEG samples $A_{i}$ and $B_{i}$ are under the same stimulus/event to increase contrasting complexity. All training samples are augmented and transformed into $V^{1}_{i}$ and $V^{2}_{i}$ for contrast, and the sample feature representation can be generated through encoder $f_{\theta}$ and projector (pretext task decoder) $g_{\delta}^{p}$ by $z^{1}_{i}=g^{p}_{\delta}(f_{\theta}(V^{1}_{i}))$ . In the framework, $z^{1}_{i}$ and $z^{2}_{i}$ form the positive pair, indicating samples exchanged EEG signal with each other, while $z^{1}_{i}$ and $z^{2}_{j}$ form the negative pair ( $i\neq j$ ). The loss function is then defined as follows:

(21)		$\displaystyle L=-\frac{1}{2}(\frac{1}{\|\mathcal{B}\|}\sum_{i=0}^{\|\mathcal{B}\|}% log\frac{exp(s(z^{1}_{i},z^{2}_{i})/\tau)}{\sum_{j=0}^{\|\mathcal{B}\|}\mathbbm{% 1}_{[j\neq i]}(s(z^{1}_{i},z^{1}_{j})/\tau)+\sum_{j=0}^{\|B\|}(s(z^{1}_{i},z^{2}% _{j})/\tau)}$
(21)		$\displaystyle+\frac{1}{\|\mathcal{B}\|}\sum_{i=0}^{\|\mathcal{B}\|}log\frac{exp(s(% z^{1}_{i},z^{2}_{i})/\tau)}{\sum_{j=0}^{\|\mathcal{B}\|}\mathbbm{1}_{[j\neq i]}(% s(z^{2}_{i},z^{2}_{j})/\tau)+\sum_{j=0}^{\|\mathcal{B}\|}(s(z^{1}_{j},z^{2}_{i})% /\tau)})$

where $s(\cdot,\cdot)$ is the function to measure the similarity between representations, and $\mathbbm{1}_{[j\neq i]}\in\{0,1\}$ is the indicator function that equals 0 when $i=j$ . The proposed contrastive loss aims to minimize the distance between mutually coupled sample pairs $(V^{1}_{i},V^{2}_{i})$ , and maximize the distance between other sample pairs without mutual coupling: $(V^{1}_{i},V^{1}_{j})$ , $(V^{2}_{i},V^{2}_{j})$ , and $(V^{1}_{i},V^{2}_{j})$ , where $i\neq j$ . By minimizing the loss function, the model is trained to comprehend specific and coherent channel features and can discriminate homologous EEG channel data, which can be regarded as the model capturing the EEG channel distribution knowledge, proving highly beneficial for EEG-based tasks.

5.4. Composite Contrastive Learning

Composite contrastive learning is the complex framework that augments EEG signals in multiple views or domains and conducts cross-view and cross-domain and contrastive learning to extract more expressive and complex representations integrating specific signal knowledge. Figure 11 shows an example of the typical framework, and existing composite EEG contrastive learning frameworks are listed as follows:

Frequency-temporal contrastive method conducts contrastive learning on temporal and frequency domain. Two different frequency-temporal contrastive strategies have been investigated:

(1) Complementary strategy conducts cross-view contrastive learning to avoid the ignorance of complementary information in different views (Kumar et al., 2022). EEG signal $x_{i}$ is augmented into $x_{i,1}$ and $x_{i,2}$ through signal transformations, which are then mapped into the temporal and spectral domain independently, generating temporal components $x^{t}_{i,1}$ , $x^{t}_{i,1}$ and spectrum components $x^{s}_{i,1}$ , $x^{s}_{i,2}$ . Different augmentations are subsequently processed through the temporal encoder $f_{\theta}^{t}$ and spectrum encoder $f_{\theta}^{s}$ to construct the representations $z^{s}_{i,1}$ and $z^{t}_{i,1}$ from augmentation $x_{i,1}$ , and $z^{s}_{i,2}$ and $z^{t}_{i,2}$ from augmentation $x_{i,2}$ . Four losses are combined to train the model: 1. temporal contrastive loss, denoted as

$\mathcal{L}_{tt}$ . The temporal representations generated from same augmentation form positive pairs $\{z^{t}_{i,1},z^{t}_{i,2}\}$ , and from different augmentation form negative pairs $\{z^{t}_{i,1},z^{t}_{i,2}\},i\neq j$ , with the infoNCE serves as the loss function. 2. Spectrum contrastive loss, denoted as $\mathcal{L}_{ss}$ calculated by spectral augmented representations similar to temporal contrastive loss. 3. Mixing contrastive loss, denoted as $\mathcal{L}_{gg}$ . The spectrum and temporal augmented representations are concatenated to form the mixing augmented representation, where $z^{g}_{i,1}=cat(z^{t}_{i,1},z^{s}_{i,1})$ , $cat$ represents the concatenation operation. This loss can be calculated similarly to the first two losses. 4. Complementary loss, denoted as $\mathcal{L}_{d}$ . The above losses may narrow the distance between representations, losing complementary features in each view. Therefore, the complementary loss is designed to pull corresponding augmented samples in the same view closer while pushing away the corresponding augmented samples in different views. Assuming $z_{i}=\{z^{t}_{i,1},z^{t}_{i,2},z^{s}_{i,1},z^{t}_{i,2}\}$ , complementary loss is defined as:

(22)

\displaystyle\left\{\begin{aligned} &l_{d}(z_{i},j,k)=-log\frac{exp(s(z_{i}[j]% ,z_{i}[k])/\tau)}{\sum_{q=1}^{4}\mathbbm{1}_{[q\neq j]}exp(s(z_{i}[j],z_{i}[q]% )/\tau)}\\ &\mathcal{L}_{D}=\frac{1}{4|\mathcal{B}|}\sum_{i=0}^{|\mathcal{B}|}l_{d}(z_{i}% ,1,2)+l_{d}(z_{i},2,1)+l_{d}(z_{i},3,4)+l_{d}(z_{i},4,3)\end{aligned}\right.

where $s()$ is the similarity function. Multiple loss functions are combined to train the model to extract multi-domain features and preserve the domain-specific and complementary features, which can be described as follows:

(23)

\mathcal{L}_{con}=\lambda_{1}(\mathcal{L}_{tt}+\mathcal{L}_{ss}+\mathcal{L}_{% gg})+\lambda_{2}\mathcal{L}_{D}

where $\lambda_{1}$ and $\lambda_{2}$ are the hyperparameters to balance different contrastive losses.

(2) Consistent strategy focuses on extracting consistent information between temporal and frequency representations through contrastive learning (Zhang et al., 2022c). Different from the complementary strategy, this strategy aims to maximize the mutual information between temporal and frequency representation to align different representations in a latent feature space to extract multi-domain coherent features. The consistent loss function is described as follows:

(24)

\mathcal{L}_{c}=\frac{1}{|\mathcal{B}|}\sum_{i=1}^{|\mathcal{B}|}\sum_{Sim^{*}% }(Sim^{t_{1},f_{1}}-Sim^{*}+\delta),Sim^{*}\in\{Sim^{t_{1},f_{2}},Sim^{t_{2},f% _{1}},Sim^{t_{2},f_{2}}\}

where $\delta$ is the hyperparameter. In this loss function, the $Sim^{t_{1},f_{1}}_{i}=d(z^{t}_{i,1},z^{f}_{i,1})$ is defined to measure the representation similarity, where $z^{t}_{i,1}$ and $z^{f}_{i,1}$ are the temporal and frequency representations generated by sample $x_{i}$ , and $z^{t}_{i,2}$ and $z^{f}_{i,2}$ are generated by augmented sample $x^{*}_{i}$ . By minimizing this loss, the frequency and temporal representations can be pulled closer for a sample in the latent space to mine for multi-domain consistent features.

Multi-view CPC extends CPC from the single view to multiple views for exploring complex EEG features (Eldele et al., 2021). In this method, the weak and strong augmentation methods are designed to construct two views: jitter-and-scale strategy is used to construct weak augmentations $x^{1}_{i}$ , while permutation-and-jitter generates complex strong augmentation $x^{2}_{i}$ . According to the definition of CPC, the integrated features $c^{1}_{i}$ and $c^{2}_{i}$ of context windows from two views are generated. Cross-view prediction strategy is implemented, where $c^{1}_{i}$ is used to predict future windows $z^{2}_{i+1}$ and $c^{2}_{i}$ is used to predict future windows $z^{1}_{i+1}$ , generating CPC losses $\mathcal{L}_{1,2}$ and $\mathcal{L}_{2,1}$ . Besides, the cross-view contextual contrastive strategy is designed to extract discriminative features: $c_{i}^{1}$ and $c_{i}^{2}$ generated from the same sample but in different views form the positive pairs, while other representations form negative pairs. For the samples in batch $\mathcal{B}$ , one given sample can construct 1 positive pair and $2|\mathcal{B}|-2$ negative pairs. The loss function is defined as follows:

(25)

\mathcal{L}_{CC}=-\sum_{i=1}^{|\mathcal{B}|}\frac{exp(s(c_{i}^{1},c_{i}^{2})/% \tau)}{\sum_{j=1}^{|\mathcal{B}|}\sum_{q=1}^{2}\mathbbm{1}_{[j\neq i]}exp(s(c_% {i}^{1},c_{j}^{q})/\tau)}

Different loss functions are combined as $\mathcal{L}=\lambda_{1}(\mathcal{L}_{1,2}+\mathcal{L}_{2,1})+\lambda_{2}% \mathcal{L}_{CC}$ through the weight $\lambda_{1}$ and $\lambda_{2}$ to balance different losses. The multi-view CPC method can mine complex temporal features and understand the aligned representation.

Multi-level contrastive method conducts contrastive learning at multiple levels to capture complex signal features (Zhang et al., 2022a). In this framework, EEG sample is divided into $n_{l}$ segments: $X_{i}=\{x_{1},x_{2},x_{3},...,x_{(n_{l})}\}$ , with CNN encoder $f_{\theta}^{1}$ to generate local representation $Z_{i}=\{z_{1},z_{2},z_{3},...,z_{n_{l}}\}$ and transformer encoder $f_{\theta}^{2}$ to generate contextual representation $R_{i}=\{r_{1},r_{1+k},r_{1+2k}...\}$ , where contextual representation $r_{j}$ is generated by the integration of $k$ neighbor local representations. The positive and negative sample pairs are constructed according to the filter with EEG rhythm rules. InfoNCE loss is used at local and contextual levels to extract multi-granular features. The fusion of multi-level contrastive learning can integrate different aspects of temporal features of EEG signals into representation, making representation more efficient and expressive.

Scalp-dipole neural contrastive is a knowledge-based cross-view contrastive method to generate general neural representation (Weng et al., 2023). In this framework, two views are constructed according to the neural source (Jackson and Bolger, 2014) of EEG signal $x_{i}$ : the scalp view $sc_{i}$ is constructed by spatial matrix, indicating the distribution of EEG voltage across the scalp; the dipole view $dp_{i}$ is constructed by undirected graph, indicating the inner correlation of dipoles (activated pyramidal cells) that produce the EEG signals. EEG signals are augmented through mask and jigsaw also transformed into two views. The CNN encoder $f_{\theta}^{c}$ and graph convolutional encoder $f_{\theta}^{g}$ are designed to generate scalp and dipole representation from different views, and two contrastive

strategies are proposed: (1) Inner-view contrastive aims to extract invariant features in each view. Augmented samples in a specific view are considered as positive pairs and the Barlow twins loss is implemented to minimize their distance in the representation space and capture invariant information between augmented samples; (2) Cross-view contrastive is based on the theory that EEG representations for different views are homogenous and contain similar neural information. Therefore, augmentations in two views generated by the same sample construct positive pairs and augmentations in two views construct negative pairs, with InfoNCE loss to train the model. Combining the inner-view and cross-view losses can extract both view-specific features and latent neural information, generating general representations effective for different EEG-based tasks.

5.5. Task-oriented EEG Contrastive Learning

Task-oriented contrastive learning is an idiosyncratic framework set up to solve specific tasks. Some specific contrastive frameworks are designed for particular tasks: (1) Image-EEG contrastive(Song et al., 2023), where the image and the corresponding EEG signals elicited by viewing this image can form positive pairs, and the image with other EEG signals can form negative pairs. The process of image-EEG contrastive learning can be shown in Figure 12. The mutual information between the image and its corresponding EEG signal is maximized by image-EEG contrastive, which can solve the task of EEG image decoding. (2) Speech-EEG contrastive learning is inspired by the contrastive language-image pre-training (CLIP)(Radford et al., 2021) method that forms the EEG-speech sample pairs to extract correlation and solve the EEG speech decoding task (Défossez et al., 2023). (3) Cross-subject contrastive learning aims to address the individual variability issue in EEG signals. For example, age contrastive learning selects anchor samples at different age groups, and samples with a slight difference in age from the anchor sample are used as positive samples to construct positive pairs, while those with a large difference in age from them are used to form positive pairs. By contrasting negative and positive pairs, the model can capture the age-related brain features to improve the generalizability of generated representations (Wagh et al., 2021).

Table 3. The summarization of contrastive-based EEG analysis SSL method. ”CPC” represents contrastive predictive coding, ”CL” represents contrastive learning, ”Tfm” represents the transformer model, ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, and ”CT” represents joint-training mode.

Approach	Sub-category	Detailed method	Backbone	Downstream Tasks	Training Mode
Clinical EEG SSL (Banville et al., 2021)	Contrastive predictive coding	EEG-based CPC	CNN	Sleep & pathology classification	PT
SSL for EEG (Banville et al., 2019)	Contrastive predictive coding	EEG-based CPC	CNN	Sleep classification	PT
ContrastWR(Yang et al., 2023)	Transformation-based contrastive	Non-negative CL	CNN	Sleep & pathology classification	UT
SSCL for EEG(Jiang et al., 2021)	Transformation-based contrastive	Signal transformation CL	CNN	Sleep classification	PT&UT
EEG-CGS(Ho and Armanfard, 2023)	Spatial contrastive	Spatial shuffle CL	GNN	Seizure analysis	PT
GMSS (Li et al., 2022a)	Spatial contrastive	Graph-based CL	GNN	Emotion recognition	PT&UT&JT
SleepDPC (Xiao et al., 2021)	Contrastive predictive coding	EEG-based CPC	CNN&LSTM	Sleep classification	UT
Seq-SimCLR (Mohsenvand et al., 2020)	Transformation-based contrastive	Signal transformation CL	CNN&GRU	Multiple tasks	PT&UT
Domain-guide CL(Wagh et al., 2021)	Task-oriented contrastive	Cross-subject CL	CNN	Multiple tasks	PT
Multivariate CL(Brüsch et al., 2023)	Spatial contrastive	Graph-based CL	GNN	Sleep classification	PT
DS-AGC(Ye et al., 2023)	Spatial contrastive	Graph-based CL	GNN	Emotion recognition	PT
ME-MHAC(Guo et al., 2023)	Spatial contrastive	Meiosis-based CL	CNN	Emotion recognition	PT
MBrain(Cai et al., 2023)	Contrastive predictive coding	EEG-based CPC	CNN&LSTM	Seizure detection	PT
BrainNet(Chen et al., 2022b)	Contrastive predictive coding	Bidirectional CPC	GNN	Seizure detection	JT
SleepECL(Zhang et al., 2022a)	Composite contrastive	Multi-level CL	Transformer	Sleep classification	UT
TS-TCC(Eldele et al., 2021)	Composite contrastive	Multi-view CPC	Transformer	Sleep & seizure detection	PT&UT
CoSleep(Ye et al., 2021)	Contrastive predictive coding	EEG-based CPC	CNN	Sleep classification	UT
SLAM-EEG(Xiao et al., 2024)	Transformation-based contrastive	Transformation-based CL	ViT	Seizure detection	PT
SPP-EEGNET(Li and Metsis, 2022)	Transformation-based contrastive	Transformation-based CL	CNN	Multiple tasks	PT
DSSNet(Chang et al., 2022)	Contrastive predictive coding	EEG-based CPC	CNN&RNN	Sleep classification	UT
TF-C(Zhang et al., 2022c)	Composite contrastive	Frequency-temporal CL	CNN&Tfm	Sleep classification	UT
TS-MoCo(Hallgarten et al., 2023)	Transformation-based contrastive	Transformation-based CL	Transformer	Emotion recognition	PT&UT
MV-EEG(Hojjati, 2023)	Composite contrastive	Frequency-temporal CL	Transformer	Pathology detection	UT
PSN-Sleep(You et al., 2023)	Transformation-based contrastive	Non-negative CL	CNN	Sleep classification	UT
MulEEG(Kumar et al., 2022)	Composite contrastive	Frequency-temporal CL	CNN	Sleep classification	UT
MI-SSLEEG(Han et al., 2021)	Transformation-based contrastive	Transformation-based CL	CNN	Motor imagery	JT
SA-EEG(Cheng et al., 2020)	Transformation-based contrastive	Transformation-based CL	CNN	Motor imagery	UT
MtCLSS(Wang et al., 2023)	Transformation-based contrastive	Transformation-based CL	CNN	Sleep classification	UT
Multi-channel CL(Gao et al., [n. d.])	Transformation-based contrastive	Transformation-based CL	CNN	Sleep & pathology classification	UT
SGMC (Kan et al., 2023)	Spatial contrastive	Meiosis-based CL	CNN	Emotion recognition	PT
CLISA (Shen et al., 2022)	Task-oriented contrastive	Cross-subject CL	CNN	Emotion recognition	UT
KDC(Weng et al., 2023)	Composite contrastive	Scalp-dipole neural CL	CNN&GNN	Multiple tasks	PT&UT
NICE-EEG(Song et al., 2023)	Task-oriented contrastive	Image-signal CL	ViT&GNN	Image-decoding	UT
AAD(Défossez et al., 2023)	Task-oriented contrastive	Speech-signal CL	CNN&LSTM	Speech-decoding	UT

5.6. Section Discussion

In this section, various contrastive-based EEG analysis methods are comprehensively reviewed. The contrastive-based frameworks are categorized into five sub-categories: 1. contrastive predictive coding method that integrates the prediction and contrastive tasks to capture temporal information. 2. Transformation contrastive learning to extract signal-related invariant features. 3. Spatial contrastive method to capture spatial channel correlation. 4. Composite contrastive method that conducts multi-view contrastive learning to extract spatial-temporal-spectral features. 5. Task-oriented contrastive method that constructs specialized framework towards specific tasks. Compared to other SSL for EEG analysis, contrastive-based tasks are the most effective, with fewer parameters and simpler tasks to generate representations with higher generation and information density. Contrastive methods rely on the augmentation techniques, where the well-designed sample pairs can help the model integrate critical neural knowledge and arbitrarily chosen sample pairs may yield counterproductive results.

6. hybrid SSL EEG analysis method

The hybrid SSL EEG analysis method combines various pretext tasks to jointly train the model to learn complex knowledge or information. The idea of multi-task learning(Zhang and Yang, 2021; Kendall et al., 2018) has been applied in hybrid SSL methods: the common encoder $f_{\theta}$ is used to extract features and generate representation from EEG signal, with different pretext task decoders $\{g_{\delta}^{p_{1}},g_{\delta}^{p_{2}},...\}$ are used to solve multiple pretext tasks. The losses from different tasks are fused to train the model, where the shared encoder can fully leverage the advantages of different tasks to obtain representation that encompasses more knowledge and exhibits stronger expressive capabilities. The combination of multi-task losses with weight $\lambda$ can be described as follows:

(26)

\mathcal{L}_{mt}=\sum_{i=1}^{t_{n}}\lambda_{i}\mathcal{L}_{i}

Tabel 4 shows the existing hybrid EEG SSL methods. In the existing studies, different combinations of pretext tasks are used to generate representations: many methods combine the predictive and contrastive tasks (Banville et al., 2021, 2019), where the decoders predict the transformations and conduct negative and positive pairs for contrastive learning to capture critical discriminative information and invariant features; Another method combines the generative and contrastive tasks(Ho and Armanfard, 2023) to explore the local correlations and global coherent features of EEG signal. Although the hybrid SSL method can capture complex features through multiple tasks, the gradient interference caused by training various tasks may influence the effectiveness of the generated representation. Therefore, this method requires careful selection of correlated pretext tasks to avoid interference between tasks.

Table 4. The summarization of hybrid-based EEG analysis self-supervised learning method.

Approach	Pretext-category	Backbone	Downstream Tasks	Training Mode
Clinical EEG SSL(Banville et al., 2021)	Predictive task/contrastive task	CNN	Sleep and pathology classification	PT
SSL for sleep EEG (Banville et al., 2019)	Predictive task/contrastive task	CNN	Sleep and pathology classification	PT
EEG-CGS (Ho and Armanfard, 2023)	Generative task/contrastive task	CNN	Seizure analysis	PT
GMSS(Li et al., 2022a)	Predictive task/contrastive task	GNN	Emotion recognition	PT&UT
MBrain(Cai et al., 2023)	Predictive task/contrastive task	CNN&LSTM	Seizure detection	PT
MtCLSS(Li et al., 2022b)	Predictive task/contrastive task	CNN	Sleep classification	UT

7. Practical downstream tasks

SSL EEG analysis methods have been applied to various EEG-based tasks. Table 5 demonstrates the EEG-based downstream tasks and related datasets, the practical downstream tasks are listed as follows:

Emotion recognition is the task aims to decode emotional states from EEG signals collected by non-invasive electrodes. The traditional emotion recognition method combines machine learning with hand-crafted features to predict discrete emotions from EEG, while the recent emotion recognition method conducts end-to-end deep models to capture continuous emotion scores (Weng et al., 2022). The labels of the training samples are derived from the subjective rating scales or the type of stimuli that elicited the signals, which may introduce significant bias into the model training process. By combining SSL with emotion recognition task, the issue of label shift can be mitigated and the representation can improve task performance in the low-label scenarios.

Motor imagery is the task to decode the mental simulation without physically performing the movement(Tangermann et al., 2012). This task involves mentally rehearsing or imaging a specific motor action, such as imaging moving the left limb, right limb or executing a complex physical activity. The decoded imagined patterns can be applied as the control signal in the brain-computer interface (BCI). For example, controlling the exoskeleton for the disabled (Choi et al., 2020). EEG-based motor imagery recognition methods have been widely investigated. The challenges in motor imagery lie in difficult labeling and significant subject variability, which can be effectively addressed by combining different pretext tasks in SSL.

Pathology detection is the most crucial clinical tasks for EEG-based applications. This task aims to recognize the mental or neural diseases that occur in the brain from EEG signals. Deep models are used to detect seizure, autism spectrum disorder, and other disorders from EEG signals (Chen et al., 2023). However, the clinical applications of EEG signals demand high-density training data and expert knowledge to label the samples, which introduces substantial challenges in data collection. The SSL framework can reduce the number of labeled EEG samples and combine medical knowledge pretext tasks, which holds significance for the development and improvement of EEG-based clinical detection.

Table 5. The summarization of datasets that have been used in SSL EEG analysis, where the symbol ’-’ represents the missing information for the dataset

Dataset	Subject number	Sampling rates	EEG channels	Task	Label	Auxiliary data
Physionet Challenge 2018 (Ghassemi et al., 2018; Goldberger et al., 2000)	1983	200 Hz	6	Sleep classification	Weak,N1,N2,N3,RAM	EMG,EOG etc.
TUH abnormal (López et al., 2017)	2329	250,256,512 Hz	27 to 36	Abnormal detection	Normal,Abnormal	-
Sleep EDFx (PhysioBank, 2000; Kemp et al., 2000)	83	100 Hz	2	Sleep classification	Weak,N1,N2,N3,RAM	Breathe,ERP
MASS (O’reilly et al., 2014)	62	256Hz	20	Sleep classification	Weak,N1,N2,N3,RAM	EOG,EMG,ECG
MMI (Schalk et al., 2004)	105	160Hz	64	Motor imagery	Rest, MI(left), MI(right)	-
BCIC (Tangermann et al., 2012)	9	250Hz	22	Motor imagery	MI(l),MI(r),MI(f),MI(t)	EOG
Mayo-UPenn Seizure Dataset (Temko et al., 2015)	4	400Hz	16	Seizure detection	Normal,Abnormal	Dog signal
SHHS dataset(Quan et al., 1997)	-	125Hz	14	Sleep classification	Weak,N1,N2,N3,RAM	EOG,Heart
MGH Sleep (Biswal et al., 2018)	-	200Hz	6	Sleep classification	Weak,N1,N2,N3,RAM	-
Sleep EDF(PhysioBank, 2000; Kemp et al., 2000)	20	100Hz	2	Sleep classification	Weak,N1,N2,N3,RAM	EOG,EMG,ERP
Dreem Open Dataset(Goldberger et al., 2000)	80	250Hz	8,12	Sleep classification	Weak,N1,N2,N3,RAM	-
DEAP (Koelstra et al., 2011)	32	512Hz	32	Emotion recognition	Arousal,Valance,Dominant	Video,EOG,EMG
TUSZ (Shah et al., 2018)	over 300	-	19	Seizure detection	Different seizure types	-
SEED (Zheng and Lu, 2015; Duan et al., 2013)	15	200Hz	62	Emotion recognition	Negative,Neutral,Positive	-
SEED-IV (Zheng et al., 2018)	15	200Hz	62	Emotion recognition	Happy,Neutral,Sad,Fear	-
MPED (Song et al., 2019)	23	1000Hz	62	Emotion recognition	Different discrete emotions	ECG,ESR,RSP
KU-MI (Lee et al., 2019)	52	1000Hz	62	Motor imagery	MI(left),MI(right)	EMG
ISRUC (Khalighi et al., 2016)	100,8,10	200Hz	6	Sleep classification	Weak,N1,N2,N3,RAM	Multiple signals
parrKULee (Bollens et al., 2023)	85	8192Hz	64	Speech decoding	speech signal	-
CHB-MIT (Shoeb, 2009)	24	256Hz	24-26	Seizure detection	Seizure,Non-seizure	-
MPI-LEMON (Babayan et al., 2018)	216	2500Hz	62	Non	Resting states	MRI,ECG etc.
Visual object (Gifford et al., 2022)	10	1000Hz	64	Image decoding	Object label	-
MAHNOB-HCI (Soleymani et al., 2011)	27	256Hz	32	Emotion recognition	Arousal,Valance,Dominant	Multiple signals
SEEG (Chen et al., 2022b)	-	1000 or 2000Hz	52 to 124	Seizure detection	Seizure, Non seizure	Ecog
Epilepsy Dataset (Andrzejak et al., 2001)	500	173Hz	19	Seizure detection	Five seizure labels	-
AMIGOS (Miranda-Correa et al., 2018)	40	128Hz	14	Emotion recognition	Arousal,Valance,Dominant	ECG,GRS
DREAMER (Katsigiannis and Ramzan, 2017)	23	128Hz	14	Emotion recognition	Arousal,Valance,Dominant	ECG
ASD dataset (Chen et al., 2023)	4899	250Hz	20 to 129	AS-Disorder	ASD lables	-
NMT sculp dataset (Khan et al., 2022)	-	250Hz	19	Pathology detection	Normal,Abnormal	-
CUHZ (Peng et al., 2022)	25	500,698,1000Hz	22	Seizure datection	Different seizure types	-

Sleep stage classification is the task of classifying sleep EEG signals into different stages. The criteria for sleep stage classification are proposed by the American Academy of Sleep Medicine (AASM), dividing sleep EEG signals into five stages (Dement and Kleitman, 1957): W stage is the weak stage, N1 and N2 stages (Non-REM stage) are the light sleep, N3 stage is deep sleep, and REM stage is the Rapid Eye Movement sleep (Aserinsky and Kleitman, 1953). Temporal models are used to capture the temporal correlation and difference between sleep stages to accurately identify the sleep stage to which the EEG sample belongs.

Speech/Image decoding is the complex task of decoding image or speech information from the EEG signals. This task involves translating brain activity patterns recorded by EEG into meaningful visual and speech information, which can help to understand the neural mechanisms of vision and audition in the brain (Song et al., 2023). Inspired by the visual question answering (Radford et al., 2021) that aligned the text and image patches through SSL to extract the semantic information, SSL can align the EEG signal and image, speech patches to capture the inner correlation and improve the task performance.

As the practical downstream tasks mentioned above, corresponding datasets have been proposed to train the model. For emotion recognition and motor imagery tasks, the existing datasets such as SEED (Zheng and Lu, 2015) and MMI (Schalk et al., 2004) contain EEG signal with more than 30 channels, where the fine-grained spatial correlation can be extracted; On the contrary, the datasets for sleep stage classification task contain fewer channels but longer time windows, where the temporal information is critical for sleep classifications. Besides, the pathology datasets contain more subjects to ensure that the general EEG features can be extracted for clinical application.

8. Future directions

In the EEG analysis field, combining deep model and SSL frameworks can help improve the model performance on various EEG-based tasks through extra parameter training on unlabeled EEG samples with well-designed pretext tasks. In addition to the advantages of EEG-based SSL frameworks, we analyze the challenges in the existing EEG-based SSL studies and propose potential future directions for EEG-based SSL to address the challenges and problems.

Signal-oriented pretext task. Most existing pretext tasks are the straightforward extension of pretext tasks in CV and NLP, which treat EEG signals as 2D matrix and temporal vector like image or text patches to capture spatial and contextual correlations but ignore the intrinsic characteristics of EEG signal. Therefore, designing the EEG-oriented pretext task to extract the spatial-temporal-frequency EEG features is a feasible approach worth further exploration.

Knowledge-driven SSL framework. Although SSL frameworks have achieved significant success in various EEG-based tasks, the lack of theoretical foundation and neural knowledge for EEG signals leads to the generated representations lacking generalization and interpretability. Therefore, how to integrate the EEG-based neural knowledge with the SSL framework to construct the knowledge-driven interpretable EEG model is another important direction, which needs to design specific pretext tasks and augmentation techniques that can fuse explainable neural knowledge into representation. We believe that by integrating knowledge of EEG into the self-supervised framework, the models are expected to bring generalization and interpretability to representations.

Graph-based SSL. Deep learning models like CNN and RNN have been widely used to extract spatial-temporal features from EEG signals for different tasks. However, most existing methods ignore the inherent topological connections among electrodes. EEG signals are generated from the activity of neurons that are topologically connected inner the brain. Graph neural networks can explore the inherent connectivity patterns among neurons, and we believe that researching GNN-based EEG SSL methods can integrate richer latent brain information into representations, offering a new perspective for information expression.

SSL for Heterogeneous EEG. The ultimate goal of SSL for EEG analysis is to generate general representations for various downstream tasks. However, EEG signals are collected from multiple scenarios which encompass variations in channel, device, sampling rate, task, subject, and distribution. The significant differences between EEG signals from different sources make it challenging for self-supervised training collaboratively. Therefore, constructing SSL framework tailored for heterogeneous EEG data is an important direction for future development. Exploration in this direction can utilize heterogeneous EEG samples from multiple sources to jointly pre-train the model to fully utilize existing differentiated EEG datasets to mine universal representations for different downstream tasks.

Multimodal SSL. SSL for EEG signals is the unimodal approach aiming to extract neural information from unlabeled EEG samples. However, the features mined from EEG signals are difficult to adapt to some complex downstream tasks, which require other brain or physiological signals to provide more abundant information. Therefore, the EEG-based multimodal self-supervised learning method needs to be further studied to extract integrated and aligned features from unlabeled multimodal signals (ECG, EMG, EOG, etc) for challenging downstream tasks.

9. Conclusion

This paper is a comprehensive review of self-supervised learning for EEG analysis, including the reasonable taxonomy, different kinds of existing EEG-based SSL methods, downstream EEG tasks, and the available training datasets, offering detailed guidelines for researchers interested in deep learning combined with EEG analysis. We first review typical SSL frameworks and pretext tasks in the CV and NLP and introduce traditional supervised EEG analysis methods as the preliminary, to illustrate the drawbacks of supervised EEG analysis and underscore the necessity of introducing SSL for EEG analysis. We then provide a detailed exposition on four categories of SSL frameworks for EEG analysis, elucidating the technical details of representative methods to extract spatial-temporal-frequency features from EEG signals. Subsequently, we enumerate EEG-based downstream tasks effective for SSL frameworks and present relevant EEG datasets suitable for pre-training or downstream task fine-tuning. Finally, we discuss the challenges in the existing studies and propose new insights and potential future directions that warrant exploration, which can help generate a more general explainable representation to solve various complex downstream tasks.

References

(1)
Accou et al. (2023) Bernd Accou, Tom Francart, et al. 2023. Self-supervised enhancement of stimulus-evoked brain response data. arXiv preprint arXiv:2302.01924 (2023).
Aguiar-Conraria and Soares (2014) Luís Aguiar-Conraria and Maria Joana Soares. 2014. The continuous wavelet transform: Moving beyond uni-and bivariate analysis. Journal of Economic Surveys 28, 2 (2014), 344–375.
Al-Quraishi et al. (2018) Maged S Al-Quraishi, Irraivan Elamvazuthi, Siti Asmah Daud, S Parasuraman, and Alberto Borboni. 2018. EEG-based control for upper and lower limb exoskeletons and prostheses: A systematic review. Sensors 18, 10 (2018), 3342.
Alotaiby et al. (2014) Turkey N Alotaiby, Saleh A Alshebeili, Tariq Alshawi, Ishtiaq Ahmad, and Fathi E Abd El-Samie. 2014. EEG seizure detection and prediction algorithms: a survey. EURASIP Journal on Advances in Signal Processing 2014 (2014), 1–21.
Altaheri et al. (2023) Hamdi Altaheri, Ghulam Muhammad, Mansour Alsulaiman, Syed Umar Amin, Ghadir Ali Altuwaijri, Wadood Abdul, Mohamed A Bencherif, and Mohammed Faisal. 2023. Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: A review. Neural Computing and Applications 35, 20 (2023), 14681–14722.
Andrzejak et al. (2001) Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. 2001. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E 64, 6 (2001), 061907.
Aserinsky and Kleitman (1953) Eugene Aserinsky and Nathaniel Kleitman. 1953. Regularly occurring periods of eye motility, and concomitant phenomena, during sleep. Science 118, 3062 (1953), 273–274.
Babayan et al. (2018) A Babayan, M Erbey, D Kumral, JD Reinelt, AMF Reiter, J Röbbig, H Lina Schaare, M Uhlig, A Anwander, PL Bazin, et al. 2018. Data descriptor: a mind-brain-body dataset of MRI, EEG, cognition, emotion, and peripheral physiology in young and old adults. Sci. Data 6, 180308.
Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
Bagchi and Mitra (2012) Sonali Bagchi and Sanjit K Mitra. 2012. The nonuniform discrete Fourier transform and its applications in signal processing. Vol. 463. Springer Science & Business Media.
Balconi and Lucchiari (2006) Michela Balconi and Claudio Lucchiari. 2006. EEG correlates (event-related desynchronization) of emotional face elaboration: a temporal analysis. Neuroscience letters 392, 1-2 (2006), 118–123.
Banville et al. (2019) Hubert Banville, Isabela Albuquerque, Aapo Hyvärinen, Graeme Moffat, Denis-Alexander Engemann, and Alexandre Gramfort. 2019. Self-supervised representation learning from electroencephalography signals. In 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6.
Banville et al. (2021) Hubert Banville, Omar Chehab, Aapo Hyvärinen, Denis-Alexander Engemann, and Alexandre Gramfort. 2021. Uncovering the structure of clinical EEG signals with self-supervised learning. Journal of Neural Engineering 18, 4 (2021), 046020.
Bhat and Hortal (2021) Sudhanva Bhat and Enrique Hortal. 2021. Gan-based data augmentation for improving the classification of eeg signals. In The 14th pervasive technologies related to assistive environments conference. 453–458.
Biswal et al. (2018) Siddharth Biswal, Haoqi Sun, Balaji Goparaju, M Brandon Westover, Jimeng Sun, and Matt T Bianchi. 2018. Expert-level sleep scoring with deep neural networks. Journal of the American Medical Informatics Association 25, 12 (2018), 1643–1650.
Bollens et al. (2023) Lies Bollens, Bernd Accou, Marlies Gillis, Wendy Verheijen, Tom Francart, et al. 2023. SparrKULee: A Speech-evoked Auditory Response Repository of the KU Leuven, containing EEG of 85 participants. (2023).
Boostani et al. (2017) Reza Boostani, Foroozan Karimzadeh, and Mohammad Nami. 2017. A comparative review on sleep stage classification methods in patients and healthy individuals. Computer methods and programs in biomedicine 140 (2017), 77–91.
Bos et al. (2006) Danny Oude Bos et al. 2006. EEG-based emotion recognition. The influence of visual and auditory stimuli 56, 3 (2006), 1–17.
Brüsch et al. (2023) Thea Brüsch, Mikkel N Schmidt, and Tommy S Alstrøm. 2023. Multi-view self-supervised learning for multivariate variable-channel time series. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6.
Cai et al. (2023) Donghong Cai, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. 2023. MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals. arXiv preprint arXiv:2306.13102 (2023).
Chang et al. (2022) Shuohua Chang, Zhihong Yang, Yuyang You, and Xiaoyu Guo. 2022. Dssnet: A deep sequential sleep network for self-supervised representation learning based on single-channel eeg. IEEE Signal Processing Letters 29 (2022), 2143–2147.
Chen et al. (2023) He Chen, Ouyang Gaoxiang, and Xiaoli Li. 2023. Extracting Temporal-Spectral-Spatial Representation of EEG Using Self-Supervised Learning for the Identification of Children with ASD. In 2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 1263–1266.
Chen et al. (2022b) Junru Chen, Yang Yang, Tao Yu, Yingying Fan, Xiaolong Mo, and Carl Yang. 2022b. Brainnet: Epileptic wave detection from seeg with hierarchical graph diffusion learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2741–2751.
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Chen et al. (2022a) Xun Chen, Chang Li, Ai** Liu, Martin J McKeown, Ruobing Qian, and Z Jane Wang. 2022a. Toward open-world electroencephalogram decoding via deep learning: A comprehensive survey. IEEE Signal Processing Magazine 39, 2 (2022), 117–134.
Cheng et al. (2020) Joseph Y Cheng, Hanlin Goh, Kaan Dogrusoz, Oncel Tuzel, and Erdrin Azemi. 2020. Subject-aware contrastive learning for biosignals. arXiv preprint arXiv:2007.04871 (2020).
Chien et al. (2022) Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M Sandino, and Joseph Y Cheng. 2022. MAEEG: Masked Auto-encoder for EEG Representation Learning. arXiv preprint arXiv:2211.02625 (2022).
Choi et al. (2020) Junhyuk Choi, Keun Tae Kim, Ji Hyeok Jeong, Laehyun Kim, Song Joo Lee, and Hyungmin Kim. 2020. Develo** a motor imagery-based real-time asynchronous hybrid BCI controller for a lower-limb exoskeleton. Sensors 20, 24 (2020), 7309.
Cimtay and Ekmekcioglu (2020) Yucel Cimtay and Erhan Ekmekcioglu. 2020. Investigating the use of pretrained convolutional neural network on cross-subject and cross-dataset EEG emotion recognition. Sensors 20, 7 (2020), 2034.
Craik et al. (2019) Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. 2019. Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering 16, 3 (2019), 031001.
Creswell et al. (2018) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE signal processing magazine 35, 1 (2018), 53–65.
Das et al. (2022) Sudip Das, Pankaj Pandey, and Krishna Prasad Miyapuram. 2022. Improving self-supervised pretraining models for epileptic seizure detection from EEG data. arXiv preprint arXiv:2207.06911 (2022).
Défossez et al. (2023) Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. 2023. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence (2023), 1–11.
Dement and Kleitman (1957) William Dement and Nathaniel Kleitman. 1957. Cyclic variations in EEG during sleep and their relation to eye movements, body motility, and dreaming. Electroencephalography and clinical neurophysiology 9, 4 (1957), 673–690.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Du et al. (2015) Nguyen Duy Du, Nguyen Hoang Huy, and Nguyen Xuan Hoai. 2015. The impact of high dimensionality on SVM when classifying ERP data-a solution from LDA. In Proceedings of the 6th International Symposium on Information and Communication Technology. 32–37.
Duan et al. (2013) Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In 6th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 81–84.
Eldele et al. (2021) Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2021. Time-series representation learning via temporal and contextual contrasting. arXiv preprint arXiv:2106.14112 (2021).
Ericsson et al. (2022) Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. 2022. Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine 39, 3 (2022), 42–62.
Fahimi et al. (2019) Fatemeh Fahimi, Zhuo Zhang, Wooi Boon Goh, Kai Keng Ang, and Cuntai Guan. 2019. Towards EEG generation using GANs for BCI applications. In 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). IEEE, 1–4.
Fu et al. (2022) Ruiqi Fu, Yi-Feng Chen, Yongqi Huang, Shu** Chen, Feiyan Duan, Jiewei Li, Jianhui Wu, Dongmei Jiang, Junling Gao, Jason Gu, et al. 2022. Symmetric convolutional and adversarial neural network enables improved mental stress classification from EEG. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30 (2022), 1384–1400.
Gao et al. ([n. d.]) Wei Gao, Zhengqing Hu, Yu Lei, Changming Wang, Fangbing Qiu, Yanqing Liu, and Lin Han. [n. d.]. A Multi-Channel Sleep Staging Method Based on Self-Supervised Learning. Available at SSRN 4580453 ([n. d.]).
Ge et al. (2021) Wendong Ge, ** **g, Sungtae An, Aline Herlopian, Marcus Ng, Aaron F Struck, Brian Appavu, Emily L Johnson, Gamaleldin Osman, Hiba A Haider, et al. 2021. Deep active learning for interictal ictal injury continuum EEG patterns. Journal of neuroscience methods 351 (2021), 108966.
Ghassemi et al. (2018) Mohammad M Ghassemi, Benjamin E Moody, Li-Wei H Lehman, Christopher Song, Qiao Li, Haoqi Sun, Roger G Mark, M Brandon Westover, and Gari D Clifford. 2018. You snooze, you win: the physionet/computing in cardiology challenge 2018. In 2018 Computing in Cardiology Conference (CinC), Vol. 45. IEEE, 1–4.
Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).
Gifford et al. (2022) Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. 2022. A large and rich EEG dataset for modeling human visual object recognition. NeuroImage 264 (2022), 119754.
Goldberger et al. (2000) Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation 101, 23 (2000), e215–e220.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
Gotman (1982) Jean Gotman. 1982. Automatic recognition of epileptic seizures in the EEG. Electroencephalography and clinical Neurophysiology 54, 5 (1982), 530–540.
Gramfort et al. (2021) Alexandre Gramfort, Hubert Banville, Omar Chehab, Aapo Hyvärinen, and Denis Engemann. 2021. Learning with self-supervision on EEG data. In 2021 9th International Winter Conference on Brain-Computer Interface (BCI). IEEE, 1–2.
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
Guo et al. (2023) Yunfei Guo, Tao Zhang, and Wu Huang. 2023. Emotion recognition based on multi-modal electrophysiology multi-head attention Contrastive Learning. arXiv preprint arXiv:2308.01919 (2023).
Hallgarten et al. (2023) Philipp Hallgarten, David Bethge, Ozan Özdcnizci, Tobias Grosse-Puppendahl, and Enkelejda Kasneci. 2023. TS-MoCo: Time-Series Momentum Contrast for Self-Supervised Physiological Representation Learning. In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 1030–1034.
Han et al. (2021) **pei Han, Xiao Gu, and Benny Lo. 2021. Semi-supervised contrastive learning for generalizable motor imagery eeg classification. In 2021 IEEE 17th International Conference on Wearable and Implantable Body Sensor Networks (BSN). IEEE, 1–4.
Harpale and Bairagi (2016) Varsha K Harpale and Vinayak K Bairagi. 2016. Time and frequency domain analysis of EEG signals for seizure detection: A review. In 2016 International Conference on Microelectronics, Computing and Communications (MicroCom). IEEE, 1–6.
He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
Henaff (2020) Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning. PMLR, 4182–4192.
Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science 313, 5786 (2006), 504–507.
Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
Ho and Armanfard (2023) Thi Kieu Khanh Ho and Narges Armanfard. 2023. Self-supervised learning for anomalous channel detection in EEG graphs: application to seizure analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7866–7874.
Hojjati (2023) Amirabbas Hojjati. 2023. A Multi-View Self-Supervised Approach to Learn Representations of EEG Data for Downstream Prediction Tasks. Master’s thesis. NTNU.
Hosseini et al. (2020) Mohammad-Parsa Hosseini, Amin Hosseini, and Kiarash Ahi. 2020. A review on machine learning for EEG signal processing in bioengineering. IEEE reviews in biomedical engineering 14 (2020), 204–218.
Huang et al. (2023) Baichuan Huang, Renato Zanetti, Azra Abtahi, David Atienza, and Amir Aminifar. 2023. Epilepsynet: Interpretable self-supervised seizure detection for low-power wearable systems. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–5.
Jackson and Bolger (2014) Alice F Jackson and Donald J Bolger. 2014. The neurophysiological bases of EEG and EEG measurement: A review for the rest of us. Psychophysiology 51, 11 (2014), 1061–1071.
Jaiswal et al. (2020) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. Technologies 9, 1 (2020), 2.
Jiang et al. (2021) Xue Jiang, Jianhui Zhao, Bo Du, and Zhiyong Yuan. 2021. Self-supervised contrastive learning for EEG-based sleep staging. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
Jiao et al. (2020) Yingying Jiao, Yini Deng, Yun Luo, and Bao-Liang Lu. 2020. Driver sleepiness detection from EEG and EOG signals using GAN and LSTM networks. Neurocomputing 408 (2020), 100–111.
Jo et al. (2023) Sangmin Jo, Jaehyun Jeon, Seungwoo Jeong, and Heung-Il Suk. 2023. Channel-Aware Self-Supervised Learning for EEG-based BCI. In 2023 11th International Winter Conference on Brain-Computer Interface (BCI). IEEE, 1–4.
Kalafatovich et al. (2020) Jenifer Kalafatovich, Minji Lee, and Seong-Whan Lee. 2020. Decoding visual recognition of objects from eeg signals based on attention-driven convolutional neural network. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2985–2990.
Kan et al. (2023) Haoning Kan, Jiale Yu, Jia** Huang, Zihe Liu, Heqian Wang, and Haiyan Zhou. 2023. Self-supervised group meiosis contrastive learning for eeg-based emotion recognition. Applied Intelligence (2023), 1–19.
Katsigiannis and Ramzan (2017) Stamos Katsigiannis and Naeem Ramzan. 2017. DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE journal of biomedical and health informatics 22, 1 (2017), 98–107.
Kemp et al. (2000) Bob Kemp, Aeilko H Zwinderman, Bert Tuk, Hilbert AC Kamphuisen, and Josefien JL Oberye. 2000. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Transactions on Biomedical Engineering 47, 9 (2000), 1185–1194.
Kendall et al. (2018) Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491.
Khalighi et al. (2016) Sirvan Khalighi, Teresa Sousa, José Moutinho Santos, and Urbano Nunes. 2016. ISRUC-Sleep: A comprehensive public dataset for sleep researchers. Computer methods and programs in biomedicine 124 (2016), 180–192.
Khan et al. (2022) Hassan Aqeel Khan, Rahat Ul Ain, Awais Mehmood Kamboh, Hammad Tanveer Butt, Saima Shafait, Wasim Alamgir, Didier Stricker, and Faisal Shafait. 2022. The NMT scalp EEG dataset: an open-source annotated dataset of healthy and pathological EEG recordings for predictive modeling. Frontiers in neuroscience 15 (2022), 755817.
Ko and Suk (2022) Wonjun Ko and Heung-Il Suk. 2022. Eeg-oriented self-supervised learning and cluster-aware adaptation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4143–4147.
Koelstra et al. (2011) Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2011. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on affective computing 3, 1 (2011), 18–31.
Kostas et al. (2021) Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. 2021. BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience 15 (2021), 653659.
Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E 69, 6 (2004), 066138.
Kumar et al. (2022) Vamsi Kumar, Likith Reddy, Shivam Kumar Sharma, Kamalaker Dadi, Chiranjeevi Yarra, Raju S Bapi, and Srijithesh Rajendran. 2022. mulEEG: a multi-view representation learning on EEG signals. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 398–407.
Lee et al. (2022) Harim Lee, Eunseon Seong, and Dong-Kyu Chae. 2022. Self-supervised learning with attention-based latent signal augmentation for sleep staging with limited labeled data. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, LD Raedt, Ed. International Joint Conferences on Artificial Intelligence Organization, Vol. 7. 3868–3876.
Lee et al. (2019) Min-Ho Lee, O-Yeon Kwon, Yong-Jeong Kim, Hong-Kyung Kim, Young-Eun Lee, John Williamson, Siamac Fazli, and Seong-Whan Lee. 2019. EEG dataset and OpenBMI toolbox for three BCI paradigms: An investigation into BCI illiteracy. GigaScience 8, 5 (2019), giz002.
Lesaja et al. (2022) Srdjan Lesaja, Morgan Stuart, Jerry J Shih, Pedram Z Soroush, Tanja Schultz, Milos Manic, and Dean J Krusienski. 2022. Self-Supervised Learning of Neural Speech Representations From Unlabeled Intracranial Signals. IEEE Access 10 (2022), 133526–133538.
Li et al. (2022c) Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. 2022c. A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning. In Proceedings of the 30th ACM International Conference on Multimedia. 6–14.
Li and Metsis (2022) Xiaomin Li and Vangelis Metsis. 2022. Spp-eegnet: An input-agnostic self-supervised eeg representation model for inter-dataset transfer learning. In International Conference on Computing and Information Technology. Springer, 173–182.
Li et al. (2022a) Yang Li, Ji Chen, Fu Li, Boxun Fu, Hao Wu, Youshuo Ji, Yi** Zhou, Yi Niu, Guangming Shi, and Wenming Zheng. 2022a. GMSS: Graph-based multi-task self-supervised learning for EEG emotion recognition. IEEE Transactions on Affective Computing (2022).
Li et al. (2022b) Yamei Li, Shengqiong Luo, Haibo Zhang, Yinkai Zhang, Yuan Zhang, and Benny Lo. 2022b. MtCLSS: Multi-Task Contrastive Learning for Semi-Supervised Pediatric Sleep Staging. IEEE Journal of Biomedical and Health Informatics (2022).
Li et al. (2019) Yitong Li, Michael Murias, Samantha Major, Geraldine Dawson, and David Carlson. 2019. On target shift in adversarial domain adaptation. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 616–625.
Liu et al. (2020) Junxiu Liu, Guopei Wu, Yuling Luo, Senhui Qiu, Su Yang, Wei Li, and Yifei Bi. 2020. EEG-based emotion classification using a deep neural network and sparse autoencoder. Frontiers in Systems Neuroscience 14 (2020), 43.
Liu et al. (2022) Yixin Liu, Ming **, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and S Yu Philip. 2022. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering 35, 6 (2022), 5879–5900.
López et al. (2017) Silvia López, I Obeid, and J Picone. 2017. Automated interpretation of abnormal adult electroencephalograms. Ph. D. Dissertation.
Miranda-Correa et al. (2018) Juan Abdon Miranda-Correa, Mojtaba Khomami Abadi, Nicu Sebe, and Ioannis Patras. 2018. Amigos: A dataset for affect, personality and mood research on individuals and groups. IEEE Transactions on Affective Computing 12, 2 (2018), 479–493.
Mirzaei and Ghasemi (2021) Sayeh Mirzaei and Parisa Ghasemi. 2021. EEG motor imagery classification using dynamic connectivity patterns and convolutional autoencoder. Biomedical Signal Processing and Control 68 (2021), 102584.
Mohsenvand et al. (2020) Mostafa Neo Mohsenvand, Mohammad Rasool Izadi, and Pattie Maes. 2020. Contrastive representation learning for electroencephalogram classification. In Machine Learning for Health. PMLR, 238–253.
Montero Quispe et al. (2022) Kevin G Montero Quispe, Daniel MS Utyiama, Eulanda M Dos Santos, Horácio ABF Oliveira, and Eduardo JP Souto. 2022. Applying self-supervised representation learning for emotion recognition using physiological signals. Sensors 22, 23 (2022), 9102.
Noroozi and Favaro (2016) Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision. Springer, 69–84.
Oh et al. (2014) Seung-Hyeon Oh, Yu-Ri Lee, and Hyoung-Nam Kim. 2014. A novel EEG feature extraction method using Hjorth parameter. International Journal of Electronics and Electrical Engineering 2, 2 (2014), 106–110.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
O’reilly et al. (2014) Christian O’reilly, Nadia Gosselin, Julie Carrier, and Tore Nielsen. 2014. Montreal Archive of Sleep Studies: an open-access resource for instrument benchmarking and exploratory research. Journal of sleep research 23, 6 (2014), 628–635.
Ou et al. (2022) Yanghan Ou, Siqin Sun, Haitao Gan, Ran Zhou, and Zhi Yang. 2022. An improved self-supervised learning for EEG classification. Math. Biosci. Eng 19 (2022), 6907–6922.
Palo et al. (2015) HK Palo, Mihir Narayana Mohanty, and Mahesh Chandra. 2015. Use of different features for emotion recognition using MLP network. In Computational Vision and Robotics: Proceedings of ICCVR 2014. Springer, 7–15.
Partovi et al. (2023) Andi Partovi, Anthony N Burkitt, and David Grayden. 2023. A Self-Supervised Task-Agnostic Embedding for EEG Signals. In 2023 11th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 1–4.
Peng et al. (2022) Ruimin Peng, Changming Zhao, Jun Jiang, Guangtao Kuang, Yuqi Cui, Yifan Xu, Hao Du, Jianbo Shao, and Dongrui Wu. 2022. TIE-EEGNet: Temporal information enhanced EEGNet for seizure subtype classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30 (2022), 2567–2576.
Peng et al. (2023) Ruimin Peng, Changming Zhao, Yifan Xu, Jun Jiang, Guangtao Kuang, Jianbo Shao, and Dongrui Wu. 2023. WAVELET2VEC: A Filter Bank Masked Autoencoder for EEG-Based Seizure Subtype Classification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
Petrantonakis and Hadjileontiadis (2009) Panagiotis C Petrantonakis and Leontios J Hadjileontiadis. 2009. Emotion recognition from EEG using higher order crossings. IEEE Transactions on information Technology in Biomedicine 14, 2 (2009), 186–197.
PhysioBank (2000) PhysioToolkit PhysioBank. 2000. Physionet: components of a new research resource for complex physiologic signals. Circulation 101, 23 (2000), e215–e220.
Pulver et al. (2023) Dustin Pulver, Prithila Angkan, Paul Hungler, and Ali Etemad. 2023. EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning. In Proceedings of the 25th International Conference on Multimodal Interaction. 190–197.
Quan et al. (1997) Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O’Connor, David M Rapoport, Susan Redline, John Robbins, Jonathan M Samet, et al. 1997. The sleep heart health study: design, rationale, and methods. Sleep 20, 12 (1997), 1077–1085.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Rafiei et al. (2022) Mohammad H Rafiei, Lynne V Gauthier, Hojjat Adeli, and Daniel Takabi. 2022. Self-supervised learning for electroencephalography. IEEE Transactions on Neural Networks and Learning Systems (2022).
Roach and Mathalon (2008) Brian J Roach and Daniel H Mathalon. 2008. Event-related EEG time-frequency analysis: an overview of measures and an analysis of early gamma band phase locking in schizophrenia. Schizophrenia bulletin 34, 5 (2008), 907–926.
Rodenbeck et al. (2006) Andrea Rodenbeck, Ralf Binder, Peter Geisler, Heidi Danker-Hopfe, Reimer Lund, Friedhart Raschke, Hans-Günther Weeß, and Hartmut Schulz. 2006. A review of sleep EEG patterns. Part I: A compilation of amended rules for their visual recognition according to Rechtschaffen and Kales. Somnologie 10, 4 (2006), 159–175.
Sabbagh et al. (2020) David Sabbagh, Pierre Ablin, Gaël Varoquaux, Alexandre Gramfort, and Denis A Engemann. 2020. Predictive regression modeling with MEG/EEG: from source power to signals and cognitive states. NeuroImage 222 (2020), 116893.
Schalk et al. (2004) Gerwin Schalk, Dennis J McFarland, Thilo Hinterberger, Niels Birbaumer, and Jonathan R Wolpaw. 2004. BCI2000: a general-purpose brain-computer interface (BCI) system. IEEE Transactions on biomedical engineering 51, 6 (2004), 1034–1043.
Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
Shah et al. (2018) Vinit Shah, Eva Von Weltin, Silvia Lopez, James Riley McHugh, Lillian Veloso, Meysam Golmohammadi, Iyad Obeid, and Joseph Picone. 2018. The temple university hospital seizure detection corpus. Frontiers in neuroinformatics 12 (2018), 83.
Shen et al. (2022) Xinke Shen, Xianggen Liu, Xin Hu, Dan Zhang, and Sen Song. 2022. Contrastive learning of subject-invariant eeg representations for cross-subject emotion recognition. IEEE Transactions on Affective Computing (2022).
Shoeb (2009) Ali Hossam Shoeb. 2009. Application of machine learning to epileptic seizure onset detection and treatment. Ph. D. Dissertation. Massachusetts Institute of Technology.
Shoeibi et al. (2021) Afshin Shoeibi, Navid Ghassemi, Roohallah Alizadehsani, Modjtaba Rouhani, Hossein Hosseini-Nejad, Abbas Khosravi, Maryam Panahiazar, and Saeid Nahavandi. 2021. A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in EEG signals. Expert Systems with Applications 163 (2021), 113788.
Singh and Malhotra (2022) Kuldeep Singh and Jyoteesh Malhotra. 2022. Smart neurocare approach for detection of epileptic seizures using deep learning based temporal analysis of EEG patterns. Multimedia Tools and Applications 81, 20 (2022), 29555–29586.
Siuly et al. (2016) Siuly Siuly, Yan Li, and Yanchun Zhang. 2016. EEG signal analysis and classification. IEEE Trans Neural Syst Rehabilit Eng 11 (2016), 141–144.
Soleymani et al. (2011) Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. 2011. A multimodal database for affect recognition and implicit tagging. IEEE transactions on affective computing 3, 1 (2011), 42–55.
Song et al. (2019) Tengfei Song, Wenming Zheng, Cheng Lu, Yuan Zong, Xilei Zhang, and Zhen Cui. 2019. MPED: A multi-modal physiological emotion database for discrete emotion recognition. IEEE Access 7 (2019), 12177–12191.
Song et al. (2023) Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. 2023. Decoding Natural Images from EEG for Object Recognition. arXiv preprint arXiv:2308.13234 (2023).
Tangermann et al. (2012) Michael Tangermann, Klaus-Robert Müller, Ad Aertsen, Niels Birbaumer, Christoph Braun, Clemens Brunner, Robert Leeb, Carsten Mehring, Kai J Miller, Gernot Mueller-Putz, et al. 2012. Review of the BCI competition IV. Frontiers in neuroscience (2012), 55.
Temko et al. (2015) Andriy Temko, Achintya Sarkar, and Gordon Lightbody. 2015. Detection of seizures in intracranial EEG: UPenn and Mayo Clinic’s seizure detection challenge. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 6582–6585.
Teplan et al. (2002) Michal Teplan et al. 2002. Fundamentals of EEG measurement. Measurement science review 2, 2 (2002), 1–11.
Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057 (2000).
Übeyli (2009) Elif Derya Übeyli. 2009. Statistics over features: EEG signals analysis. Computers in Biology and Medicine 39, 8 (2009), 733–741.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wagh et al. (2021) Neeraj Wagh, Jionghao Wei, Samarth Rawal, Brent Berry, Leland Barnard, Benjamin Brinkmann, Gregory Worrell, David Jones, and Yogatheesan Varatharajah. 2021. Domain-guided self-supervision of eeg data improves downstream classification performance and generalizability. In Machine Learning for Health. PMLR, 130–142.
Wang et al. (2023) Xingyi Wang, Yuliang Ma, Jared Cammon, Feng Fang, Yunyuan Gao, and Yingchun Zhang. 2023. Self-Supervised EEG Emotion Recognition Models Based on CNN. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 1952–1962.
Wang and Qi (2022) Xiao Wang and Guo-Jun Qi. 2022. Contrastive learning with stronger augmentations. IEEE transactions on pattern analysis and machine intelligence 45, 5 (2022), 5549–5560.
Wen and Zhang (2018) Tingxi Wen and Zhongnan Zhang. 2018. Deep convolution neural network and autoencoders-based unsupervised feature learning of EEG signals. IEEE Access 6 (2018), 25399–25410.
Weng et al. (2022) Weining Weng, Yang Gu, Yiqiang Chen, Guoqiang Wang, and Nianfeng Shi. 2022. An Efficient Spatial-Temporal Representation Method for EEG Emotion Recognition. In 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta). IEEE, 458–467.
Weng et al. (2023) Weining Weng, Yang Gu, Qihui Zhang, Yingying Huang, Chunyan Miao, and Yiqiang Chen. 2023. A Knowledge-Driven Cross-view Contrastive Learning for EEG Representation. arXiv preprint arXiv:2310.03747 (2023).
Wu et al. (2022) Di Wu, Siyuan Li, Jie Yang, and Mohamad Sawan. 2022. neuro2vec: Masked fourier spectrum prediction for neurophysiological representation learning. arXiv preprint arXiv:2204.12440 (2022).
Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
Xi et al. (2022) Liang Xi, Zichao Yun, Han Liu, Ruidong Wang, Xunhua Huang, and Haoyi Fan. 2022. Semi-supervised time series classification model with self-supervised learning. Engineering Applications of Artificial Intelligence 116 (2022), 105331.
Xiao et al. (2021) Qinfeng Xiao, **g Wang, Jianan Ye, Hongjun Zhang, Yuyan Bu, Yiqiong Zhang, and Hao Wu. 2021. Self-supervised learning for sleep stage classification with predictive and discriminative contrastive coding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1290–1294.
Xiao et al. (2024) Tiantian Xiao, Ziwei Wang, Yongfeng Zhang, Shuai Wang, Hailing Feng, Yanna Zhao, et al. 2024. Self-supervised Learning with Attention Mechanism for EEG-based seizure detection. Biomedical Signal Processing and Control 87 (2024), 105464.
Xu et al. (2020) Junjie Xu, Yaojia Zheng, Yifan Mao, Ruixuan Wang, and Wei-Shi Zheng. 2020. Anomaly detection on electroencephalography with self-supervised learning. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 363–368.
Yang et al. (2023) Chaoqi Yang, Cao Xiao, M Brandon Westover, Jimeng Sun, et al. 2023. Self-supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study. JMIR AI 2, 1 (2023), e46769.
Ye et al. (2021) Jianan Ye, Qinfeng Xiao, **g Wang, Hongjun Zhang, Jiaoxue Deng, and Youfang Lin. 2021. Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Processing Letters 29 (2021), 189–193.
Ye et al. (2023) Weishan Ye, Zhiguo Zhang, Min Zhang, Fei Teng, Li Zhang, Linling Li, Gan Huang, Jianhong Wang, Dong Ni, and Zhen Liang. 2023. Semi-Supervised Dual-Stream Self-Attentive Adversarial Graph Contrastive Learning for Cross-Subject EEG-based Emotion Recognition. arXiv preprint arXiv:2308.11635 (2023).
You et al. (2023) Yuyang You, Shuohua Chang, Zhihong Yang, and Qihang Sun. 2023. PSNSleep: a self-supervised learning method for sleep staging based on Siamese networks with only positive sample pairs. Frontiers in Neuroscience 17 (2023), 1167723.
Zbontar et al. (2021) Jure Zbontar, Li **g, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning. PMLR, 12310–12320.
Zhai et al. (2018) Junhai Zhai, Sufang Zhang, Junfen Chen, and Qiang He. 2018. Autoencoder and its various variants. In 2018 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 415–419.
Zhang et al. (2022a) Hongjun Zhang, **g Wang, Jiahong Xiong, Yuxuan Ding, Zhenliang Gan, and Youfang Lin. 2022a. Expert knowledge inspired contrastive learning for sleep staging. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–6.
Zhang et al. (2008) Lin Zhang, Jonathan Samet, Brian Caffo, Isaac Bankman, and Naresh M Punjabi. 2008. Power spectral analysis of EEG activity during sleep in cigarette smokers. Chest 133, 2 (2008), 427–432.
Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 649–666.
Zhang and Chen (2016) Tao Zhang and Wanzhong Chen. 2016. LMD based features for the automatic seizure detection of EEG signals using SVM. IEEE Transactions on Neural Systems and Rehabilitation Engineering 25, 8 (2016), 1100–1108.
Zhang et al. (2022b) Wenrui Zhang, Ling Yang, Shijia Geng, and Shenda Hong. 2022b. Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer. arXiv preprint arXiv:2205.09928 (2022).
Zhang et al. (2022c) Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022c. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems 35 (2022), 3988–4003.
Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2021), 5586–5609.
Zhang et al. (2022d) Zhi Zhang, Sheng-hua Zhong, and Yan Liu. 2022d. GANSER: A self-supervised data augmentation framework for EEG-based emotion recognition. IEEE Transactions on Affective Computing (2022).
Zheng et al. (2018) Wei-Long Zheng, Wei Liu, Yifei Lu, Bao-Liang Lu, and Andrzej Cichocki. 2018. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE transactions on cybernetics 49, 3 (2018), 1110–1122.
Zheng and Lu (2015) Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks. IEEE Transactions on Autonomous Mental Development 7, 3 (2015), 162–175. https://doi.org/10.1109/TAMD.2015.2431497
Zheng et al. (2022) Yaojia Zheng, Zhouwu Liu, Rong Mo, Ziyi Chen, Wei-shi Zheng, and Ruixuan Wang. 2022. Task-oriented self-supervised learning for anomaly detection in electroencephalography. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 193–203.
Zhu et al. (2023) Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, and Yuchen Hu. 2023. Eeg2vec: Self-Supervised Electroencephalographic Representation Learning. arXiv preprint arXiv:2305.13957 (2023).