License: arXiv.org perpetual non-exclusive license
arXiv:2401.05446v1 [eess.SP] 09 Jan 2024

Self-supervised Learning for Electroencephalogram: A Systematic Survey

Weining Weng [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 Yang Gu [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 Shuai Guo [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 Yuan Ma [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 Zhaohua Yang [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083 Yuchen Liu [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083  and  Yiqiang Chen [email protected] Institute of Computing Technology, Chinese Academy of SciencesHaidianBei**gChina100083
(2023)
Abstract.

Electroencephalogram (EEG) is a non-invasive technique to record bioelectrical signals. Integrating supervised deep learning techniques with EEG signals has recently facilitated automatic analysis across diverse EEG-based tasks. However, the label issues of EEG signals have constrained the development of EEG-based deep models. Obtaining EEG annotations is difficult that requires domain experts to guide collection and labeling, and the variability of EEG signals among different subjects causes significant label shifts. To solve the above challenges, self-supervised learning (SSL) has been proposed to extract representations from unlabeled samples through well-designed pretext tasks. This paper concentrates on integrating SSL frameworks with temporal EEG signals to achieve efficient representation and proposes a systematic review of the SSL for EEG signals. In this paper, 1) we introduce the concept and theory of self-supervised learning and typical SSL frameworks. 2) We provide a comprehensive review of SSL for EEG analysis, including taxonomy, methodology, and technique details of the existing EEG-based SSL frameworks, and discuss the difference between these methods. 3) We investigate the adaptation of the SSL approach to various downstream tasks, including the task description and related benchmark datasets. 4) Finally, we discuss the potential directions for future SSL-EEG research.

Self-supervised learning, electroencephalogram, contrastive learning, representation learning
copyright: acmcopyrightjournalyear: 2023doi: XXXXXXX.XXXXXXXprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Theory of computation Models of computationccs: Applied computing Computational biologyccs: Computing methodologies Artificial intelligence

1. Introduction

Electroencephalography (EEG) is a neurophysiological technique that records and measures the brain’s electrical activity. The EEG signals are collected in a non-invasive way that involves placing electrodes on the scalp to measure and record the electrical impulses generated by the brain (Teplan et al., 2002). Due to the characteristic that the EEG signals are the external representation of the inner brain neural activity, which contains abundant neural information related to various brain stimuli, EEG signals have been widely studied to deal with different real-world tasks: for example, epilepsy recognition (Gotman, 1982), emotion recognition (Bos et al., 2006), sleep research (Rodenbeck et al., 2006), and the brain-computer interface application (Al-Quraishi et al., 2018). Therefore, the EEG signal is an incredible tool in neuroscience and possesses an exceptionally high clinical utility, which generally became the research focus of physiological signals.

Recently, with the fast development of deep learning and artificial intelligence, machine learning and deep learning models are integrated with labeled samples to complete different classification (Siuly et al., 2016), regression (Sabbagh et al., 2020), and generation (Fahimi et al., 2019) tasks. The combination of intelligent algorithms and labeled EEG datasets with supervised learning modes has emerged as a powerful tool to enhance the analysis and interpretation of EEG data. Traditional machine learning methods such as Support Vector Machine (SVM), RandomForeast, and Multi-Layer Perceptron (MLP) demonstrated their efficiency in detecting significant patterns from different hand-crafted EEG features (Hosseini et al., 2020). Some simple EEG-based tasks, such as EEG-based event classification, emotion recognition, epilepsy detection, and motor imagery classification, can be automatically performed by machine learning models (Du et al., 2015; Palo et al., 2015; Zhang and Chen, 2016). The end-to-end deep learning frameworks composed of the Convolutional Neural Network (CNN), Long-Short Term Memory Network (LSTM), Transformer(Vaswani et al., 2017), and other networks are implemented to model the spatial correlation between electrodes and the temporal variation of the EEG signals. Deep learning methods contain more parameters and complex network structures, with stronger learning and expression abilities to extract physiological information and recognize complex patterns. Adequate labeled EEG data and powerful deep learning models are critical elements for intelligent EEG analysis. Moreover, relying on the large amount of high-quality labeled EEG data, deep models trained with supervised modes can accomplish complex EEG tasks.

The most critical challenge intelligent EEG analysis faces is the scarcity of labeled samples. While training deep models demands extensive labeled data. However, obtaining a large-scale labeled EEG signal for model training is impractical. The annotation of the EEG signal necessitates manual intervention from experts well-versed in neurophysiology, possessing a profound familiarity with the distinctive features of interest embedded within the EEG data. The high costs and the need for expert knowledge in the annotation process make constructing EEG datasets extremely challenging. In addition, the scarcity of specific brain states significantly affects the acquisition of EEG signals (Rafiei et al., 2022). For example, abnormal emotion states and seizure states are relatively rare among the subjects, making it more difficult for sample collection. Therefore, building annotated EEG datasets for training deep models is constrained by various factors. It necessitates the involvement of domain experts (Ge et al., 2021), demanding substantial time and cost (Chen et al., 2022a), which poses a significant challenge for the application of supervised learning in EEG analysis.

Refer to caption

Figure 1. The general process of SSL-integrated EEG analysis. The black arrows represent forward propagation, the green and blue arrows denote backpropagation based on pretext task loss and downstream task loss, respectively.

Besides, supervised EEG analysis faces the inconsistency problem, which severely impacts the effectiveness of supervised learning. Interpreting EEG signals often involves subjectivity and variability among the subjects and evaluators (Cimtay and Ekmekcioglu, 2020). First, some tasks generally collected signals annotated by the participants, which have a vital subjective component and do not necessarily represent the actual states generated by their brains, leading to inconsistencies in the labels. Besides, owing to the distinctive differences in each individual’s brain, substantial variations exist in brain signals among different subjects. This diversity may result in evaluative discrepancies in labels annotated by domain experts, as different experts might assign different labels to the same EEG segment (Li et al., 2019). Such variations introduce an inconsistency in the labeled samples. Therefore, mitigating the significant influence of inconsistency issues in the training process and improving generalization ability became critical problems faced by the EEG analysis.

Self-supervised learning (SSL) has shown its superior performance in solving the challenges mentioned above, which leverages the intrinsic structure and information within data to train models without labels. Self-supervised learning designed a series of pretext tasks different from the final modeling target that generate the pseudo-label directly from the unlabeled samples to train the model (Ericsson et al., 2022). In the Computer Vision (CV) and Natural Language Process (NLP), self-supervised learning has achieved tremendous success. In CV, SSL structure helps the model to learn the effect image representation through the pretext tasks such as image rotation (Gidaris et al., 2018), jigsaw (Noroozi and Favaro, 2016), and reconstruction (He et al., 2022), which significantly improved the downstream task performance, sample efficiency and mitigate the overfitting problem (Jaiswal et al., 2020). In NLP, the mask-reconstruction (MAE) and the prompt answering pretext tasks help the language model to comprehensively understand textual context, enabling a series of functions of machine translation and conversation system (Devlin et al., 2018). Therefore, the strong representation ability and low-labeled sample requirements of the self-supervised learning paradigm demonstrate its potential as an effective training method, which offers new insights and tools for addressing various complex problems in different domains.

Implementing the SSL frameworks in the EEG field is gaining more and more attraction among various researchers (Rafiei et al., 2022). Figure 1 illustrates a typical SSL-integrated EEG analysis method. There have been certain studies investigating the combination of SSL with EEG analysis, which conduct the preliminary exploration of SSL to deal with temporal physiological signal-based tasks. Accordingly, this paper comprehensively reviews the utilization of self-supervised learning for EEG analysis, which provides an in-depth exploration of the taxonomy, the pros and cons, and the development potential of the EEG-based SSL frameworks. The main contributions of this paper are listed as follows:

(1) Comprehensive review. This paper provides a comprehensive up-to-date review of the self-supervised learning integrated EEG analysis methods. We analyze the technique details of different SSL approaches for EEG signals, including the type of pretext tasks, the mathematical description, the performance of the SSL, and some simple summaries. By comparing different methods, we outline the general process and characteristics of the EEG-based SSL methods.

(2) Systematic and reasonable taxonomy. Following the classical taxonomy of the traditional self-supervised learning methods, we rigorously categorize existing studies on self-supervised learning in EEG into four major classes: the prediction-based method, the generation-based method, the contrastive-based method, and the graph-based method.

(3) Future potential directions. We also analyze the pros and cons of various methods, identify the limitations of current works, and take into account the inherent characteristics of EEG data to indicate the potential directions for develo** SSL-based EEG analysis.

2. Preliminary

This section provides a concise overview of traditional supervised EEG-analysis methods. In addition, we outline the form definition and mathematical description of the self-supervised learning frameworks proposed in other fields (CV, NLP), which serves as a preliminary of the EEG-based self-supervised method.

2.1. Supervised EEG Analysis

EEG signals have been widely studied to decode brain activity for addressing various real-world tasks. For instance, EEG have been used to recognize specific emotion (Zheng and Lu, 2015), detect seizure (Alotaiby et al., 2014), classify sleep stage (Boostani et al., 2017), recognize motor imagery (Altaheri et al., 2023), decode visual or auditory information (Kalafatovich et al., 2020), etc. Machine learning and deep learning supervised methods have been widely adopted to analyze EEG signals, extract features, and complete specific tasks. Existing studies can be classified into two categories (Weng et al., 2022): the feature-driven and the model-driven methods, where the feature-driven methods combine the handcrafted features with traditional machine learning classifiers to interpret EEG signals, and the model-driven methods construct end-to-end deep learning models to automated extract task-related EEG features.

Feature-driven methods. The feature-driven methods use specific features extracted from EEG signals to guide the analysis process. In general, the feature-driven methods select handcrafted features that have been proven effective for the task according to the previous research by neuroscientists (Shoeibi et al., 2021). By leveraging the selected features through traditional machine learning classifiers, the models can uncover patterns, relationships, and insights in understanding EEG signals and brain activity. Various handcrafted features fed into different models are extensively applied to multiple tasks. For example, time domain features like the Hjorth parameter (Oh et al., 2014), the high order crossing (Petrantonakis and Hadjileontiadis, 2009), the statistical analysis features (Übeyli, 2009), etc; the frequency domain features like different independent frequency bands (Cimtay and Ekmekcioglu, 2020) generated through Fast Fourier Transfer and differential entropy (Duan et al., 2013), etc; the temporal frequency domain features which combined the frequency features with the time window to introduce the variation of frequency features overtime (Übeyli, 2009). Utilizing these manually engineered features as input, machine learning have demonstrated a dependable performance in tasks such as emotion recognition, sleep stage classification, and motor imagery classification (Hosseini et al., 2020).

Model-driven methods. Model-driven methods refer to approaches that incorporate deep end-to-end models to interpret and analyze temporal EEG raw data or the high dimensional EEG features. Deep models can capture specific spatial-temporal information to infer underlying brain dynamics, quantify brain activity, and complete complex EEG-based classification or regression tasks (Craik et al., 2019). The existing model-driven methods are typical examples of supervised deep learning approaches that rely extensively on a substantial volume of training samples. Owing to the powerful learning capabilities of deep learning and the assistance of extensively labeled samples, the efficacy of models has been further heightened across diverse complex EEG tasks.

Refer to caption
Figure 2. The taxonomy of the typical self-supervised learning methods and self-supervised EEG analysis methods

2.2. Overview of Self-supervised Learning

Self-supervised learning can extract effective representation from unlabeled samples instead of directly training end-to-end models through labeled samples, which has shown its superior performance in learning spatial images and sequential context representation in the fields of CV and NLP. In this part, we outline the mathematical definition of the SSL, explain the terms of essential concepts, and briefly divide the existing SSL frameworks into four distinct categories based on the variation in pretext tasks.

Term explanation. We provide important definitions of terms to help further understand self-supervised learning.

  • Pretext task. The pretext tasks T={t1,t2,,tn}𝑇subscript𝑡1subscript𝑡2subscript𝑡𝑛T=\{t_{1},t_{2},...,t_{n}\}italic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } refers to the learning objective or task designed to leverage the content or structure within unlabeled data to help the model learn knowledge and effective representations. The learned representations can then be transferred to downstream tasks with limited labeled data.

  • Pseudo-label. The pseudo-label Ypsubscript𝑌𝑝Y_{p}italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the artificial label created based on the pretext tasks to train the model. The pseudo-labels serve as a form of supervision that guides the self-supervised learning process to extract specific features from unlabeled samples.

  • Downstream task. The definition of downstream task dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the target or final task to be performed using the features or representations learned from the previous phase of training (by pretext tasks). The downstream task typically requires labeled samples to fine-tune the previous model to transfer the representation model to become more specific and task-focused toward the downstream task.

  • Human-label. The human-label refers to the labels for the downstream samples annotated by human experts.

Mathematical definition. The objective of self-supervised learning is to learn a function (x)d𝑥superscript𝑑\mathcal{F}(x)\rightarrow\mathbb{R}^{d}caligraphic_F ( italic_x ) → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that maps the input instances X𝑋Xitalic_X to a d𝑑ditalic_d-dimensional representation space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT capturing essential features from unlabeled samples. The frameworks of self-supervised learning are generally regarded as the encoder-decoder structure encompassing an encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to generate representation and several decoders g𝑔gitalic_g to decode the representation to complete different tasks: pretext task decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT cascade with the encoder to accomplish pretext task and pre-train the model without external labels, and downstream task decoder gξdsuperscriptsubscript𝑔𝜉𝑑g_{\xi}^{d}italic_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT can recognize specific patterns in the representation to fine-tune the model adopted to complete downstream tasks. In general, the training paradigms of SSL can be summarized into three categories: 1. pre-train mode; 2. joint-train mode; 3. unsupervised-train mode. The first mode is the pre-train mode, which uses pretext tasks to pre-train the representation model and downstream tasks to fine-tune the encoder and downstream task decoder to transfer the model for addressing specific tasks. The process can be expressed as follows:

(1) θ,δ=argminθ,δpt(gδp(fθp(X)),Yp)θ,ξ=argminθ,ξft(gξd(fθp(X)),Y)formulae-sequence𝜃𝛿subscript𝜃𝛿subscript𝑝𝑡superscriptsubscript𝑔𝛿𝑝superscriptsubscript𝑓𝜃𝑝𝑋subscript𝑌𝑝𝜃𝜉subscript𝜃𝜉subscript𝑓𝑡superscriptsubscript𝑔𝜉𝑑superscriptsubscript𝑓𝜃𝑝𝑋𝑌\begin{split}\theta,\delta=\mathop{\arg\min}\limits_{\theta,\delta}\mathcal{L}% _{pt}(g_{\delta}^{p}(f_{\theta}^{p}(X)),Y_{p})\\ \theta,\xi=\mathop{\arg\min}\limits_{\theta,\xi}\mathcal{L}_{ft}(g_{\xi}^{d}(f% _{\theta}^{p}(X)),Y)\end{split}start_ROW start_CELL italic_θ , italic_δ = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_θ , italic_δ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_θ , italic_ξ = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_θ , italic_ξ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y ) end_CELL end_ROW

where ptsubscript𝑝𝑡\mathcal{L}_{pt}caligraphic_L start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT and ftsubscript𝑓𝑡\mathcal{L}_{ft}caligraphic_L start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT represent the loss of pretext task and downstream task, respectively. The encoder is trained by the pretext task and fine-tuned by the downstream task to first generate effective representation and then transfer the learned knowledge into the specific task. The downstream task decoder is trained by the downstream task loss to fully leverage the representation for target task completion.

The second mode is the co-train mode, where a joint loss function is constructed to leverage pretext and downstream tasks to jointly train the model. The pretext task collaboratively explores the relevant knowledge for the downstream task and also serves as the regularization term to constrain the gradient during training, thereby mitigating the overfitting problem. This mode can be expressed as follows:

(2) θ,ξ=argminθ,δ,ξα(gδp(fθp(X)),Yp)+β(gξd(fθp(X)),Y)𝜃𝜉subscript𝜃𝛿𝜉𝛼superscriptsubscript𝑔𝛿𝑝superscriptsubscript𝑓𝜃𝑝𝑋subscript𝑌𝑝𝛽superscriptsubscript𝑔𝜉𝑑superscriptsubscript𝑓𝜃𝑝𝑋𝑌\theta,\xi=\mathop{\arg\min}\limits_{\theta,\delta,\xi}\alpha\mathcal{L}(g_{% \delta}^{p}(f_{\theta}^{p}(X)),Y_{p})+\beta\mathcal{L}(g_{\xi}^{d}(f_{\theta}^% {p}(X)),Y)italic_θ , italic_ξ = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_θ , italic_δ , italic_ξ end_POSTSUBSCRIPT italic_α caligraphic_L ( italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_β caligraphic_L ( italic_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y )

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are the hyper-parameter to balance different losses.

The third mode is the unsupervised-train mode, which is similar to the pre-train mode, but the parameters of the encoder are frozen during the fine-tuning stage. This mode only fine-tunes the downstream task decoder to verify the generated representations’ effectiveness. The process can be formulated as follows:

(3) θ,δ=argminθ,δpt(gδp(fθp(X)),Yp)θ,ξ=argminξft(gξd(fθp(X)),Y)formulae-sequence𝜃𝛿subscript𝜃𝛿subscript𝑝𝑡superscriptsubscript𝑔𝛿𝑝superscriptsubscript𝑓𝜃𝑝𝑋subscript𝑌𝑝𝜃𝜉subscript𝜉subscript𝑓𝑡superscriptsubscript𝑔𝜉𝑑superscriptsubscript𝑓𝜃𝑝𝑋𝑌\begin{split}\theta,\delta=\mathop{\arg\min}\limits_{\theta,\delta}\mathcal{L}% _{pt}(g_{\delta}^{p}(f_{\theta}^{p}(X)),Y_{p})\\ \theta,\xi=\mathop{\arg\min}\limits_{\xi}\mathcal{L}_{ft}(g_{\xi}^{d}(f_{% \theta}^{p}(X)),Y)\end{split}start_ROW start_CELL italic_θ , italic_δ = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_θ , italic_δ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_θ , italic_ξ = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) ) , italic_Y ) end_CELL end_ROW

Within the pretext task based taxonomy of SSL, we can categorize the SSL method into four types: predictive-based, generative-based, contrastive-based, and hybrid SSL method. Figure 2 demonstrates the general taxonomy of SSL, and the detailed explanations are as follows:

Predictive-based SSL method. The predictive-based SSL method creates classification pretext tasks to predict discrete pseudo labels generated from unlabeled data to learn effective features. For instance, pretext tasks like predicting image rotation angles (Gidaris et al., 2018) and pixel colors (Zhang et al., 2016) can force the model to extract spatial features and object boundaries in the images beneficial for the downstream tasks such as object detection; pretext tasks like predicting next sentence (Devlin et al., 2018) can help language model understand contextual correlation. Due to their simpler execution nature, predictive-based SSL methods are mostly easy to combine with traditional deep models, and the proficient performance in prediction tasks signifies that the model has mastered specific knowledge for downstream tasks.

Refer to caption
Figure 3. Categories of Self-supervised learning for EEG analysis

Generative-based SSL method. The generative-based SSL method designs generation or reconstruction pretext tasks to capture contextual features and correlations to generate effective representations. The most widely used generative-based pretext task is the reconstruction task. This task begins by encoding the input sample into a distinctive representation, followed by a decoding process to reconstruct the original input. By making the input and output as similar as possible, the encoder can learn significant features to reconstruct the input, which are highly effective for target downstream tasks. For example, typical methods like autoencoder (Zhai et al., 2018) have been investigated to extract representations from image and textual data. Recently, the mask-reconstruction pretext task has supplanted traditional reconstruction tasks to extract the contextual information from unlabeled samples. This task masks part of the input samples and reconstructs the masked data through the contextual data, where the encoder is responsible for extracting features and generating representations, and the decoder is responsible for reconstructing the masked data. In vision tasks, the masked autoencoder (He et al., 2022) can extract spatial contextual features from unlabeled samples for downstream classification and segmentation. In language tasks, the BERT model captures token-level context correlation information, which greatly improves the performance of subsequent tasks such as machine translation and text generation.

Contrastive-based SSL method. The contrastive-based SSL method adopts the ’comparison’ technique, which encourages similar data points to be closer in the representation space while pushing dissimilar data points apart. Augmentation methods are important in the constrastive-based SSL: input samples are augmented to create negative and positive sample pairs, where the positive pairs represent the similar samples, and negative pairs refer to the vastly dissimilar samples (Wang and Qi, 2022). By optimizing the designed contrastive loss, the model minimizes the distance between positive pairs and maximizes the distance between negative pairs to extract identical features and transferable representations. Based on the theory of information bottleneck (Tishby et al., 2000) and mutual information (Kraskov et al., 2004), InfoNCE loss (Hjelm et al., 2018) is proposed to efficiently learn representations where positive pairs are closer together in the feature space compared to negative pairs. Besides, SimCLR (Chen et al., 2020), MoCo (He et al., 2020), and other contrastive learning methods have become important frameworks driving the development of computer vision.

Hybrid SSL method. The hybrid SSL method combines multiple SSL techniques or tasks to create a powerful framework for learning representations. The main idea is to leverage the strengths of different pretext tasks to capture diverse and informative features from unlabeled samples. The weighted fusion of losses from multiple pretext tasks enables the model to grasp multi-dimensional knowledge. It is particularly valuable when data is heterogeneous or a single pretext task may not capture all the relevant information in the unlabeled samples (Liu et al., 2022).

Following the taxonomy of typical SSL frameworks in vision and language fields, this paper categorizes self-supervised EEG analysis methods into predictive, generative, contrastive, and hybrid frameworks. Comprehensive summary for different methods are provided from Section 3 to Section 6. The structure of this survey can be visualized in Figure 3.

Refer to caption
(a) The framework of the spatial predictive method to predict channel augmentation techniques applied to EEG.
Refer to caption
(b) The framework of the temporal predictive method to predict different temporal augmentation techniques applied to EEG.
Figure 4. The comparison of spatial predictive and temporal predictive SSL EEG analysis methods

3. predictive-based SSL EEG analysis method

The predictive-based SSL EEG analysis method aims to design and execute classification to acquire domain-specific knowledge beneficial for various downstream tasks. Multi-channel EEG signals present distinctive characteristics, including high temporal density, pronounced temporal dependencies, and intricate inter-channel correlations, indicating the presence of critical features within the temporal, frequency, and spatial domains of EEG data. Sequentially, pretext tasks are implemented to distinguish EEG samples that are augmented through temporal, frequency, and spatial processing to acquire features from different domains. Therefore, we can categorize the existing studies into three sub-categories: (1) spatial predictive methods, (2) temporal predictive methods, and (3) transformation predictive methods. The typical frameworks of three kinds of methods are demonstrated in Figure 4 and Figure 5, and the summary of existing works is listed in Table 1.

3.1. Spatial Predictive Method

The spatial predictive method draws inspiration from SSL in the image domain, establishing local or global spatial-structure-related pretext tasks to help the model comprehend spatial contextual information. Figure 4a shows the typical spatial predictive framework for EEG analysis, and different methods have been investigated to extract channel correlation and brain structure, which are listed as follows:

EEG jigsaw task(Li et al., 2022a) is analogous to the image jigsaw pretext task in CV. EEG jigsaw task involves the random shuffling of EEG channels, followed by an expectation that the model can reconstruct the original sequence of the scrambled channels or predict the order in which the channels were shuffled. For example, assuming the raw EEG data Xspc×tsubscript𝑋𝑠𝑝superscript𝑐𝑡X_{sp}\in{\mathbb{R}}^{c\times t}italic_X start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t end_POSTSUPERSCRIPT, where c𝑐citalic_c is channel numbers and t𝑡titalic_t is the number of sampling points. The random shuffling operation then produces a permuted EEG matrix Xsp*subscriptsuperscript𝑋𝑠𝑝X^{*}_{sp}\in{\mathbb{R}}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ∈ blackboard_R, where the temporal information remains unchanged, but the channel order has been shuffled. The loss function of the jigsaw task can be described as follows:

(4) (Xsp*,Y*)=i=1NYi*log(gδp(fθ(Xsp*)))subscriptsuperscript𝑋𝑠𝑝superscript𝑌superscriptsubscript𝑖1𝑁subscriptsuperscript𝑌𝑖superscriptsubscript𝑔𝛿𝑝subscript𝑓𝜃subscriptsuperscript𝑋𝑠𝑝\mathcal{L}(X^{*}_{sp},Y^{*})=-\sum_{i=1}^{N}Y^{*}_{i}\log(g_{\delta}^{p}(f_{% \theta}(X^{*}_{sp})))caligraphic_L ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ) ) )

where Y*superscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT represents the one-hot pseudo-label (channel order), and N𝑁Nitalic_N represents the batch size. The loss function calculates the cross-entropy between the predicted order of the shuffled EEG sample and its corresponding label. By minimizing this pretext loss, researchers believed the model can capture spatial features related to the distribution of multi-channel EEG signals across the cortical regions of the brain, which are closely related to downstream tasks such as emotion recognition and seizure detection.

Channel correlation prediction(Cai et al., 2023) is designed to realize the spatial correlation between different channels. Researchers proposed that the time delay exists in the propagation of EEG signals between channels in distinct brain regions. EEG signals will experience delays when propagating from one region to a distant one, and exploring these features enables the model to understand the activation modes and information exchange patterns of brain activity. In this task, pseudo labels are generated from signal correlation between channels, which can be calculated as follows:

(5) Y(i,j,t1,t2)={1,cossim(Xi(t1),Xj(t2))κ0,cossim(Xi(t1),Xj(t2))<κ\displaystyle Y(i,j,t_{1},t_{2})=\left\{\begin{aligned} 1,cossim(X_{i}(t_{1}),% X_{j}(t_{2}))\geq\kappa\\ 0,cossim(X_{i}(t_{1}),X_{j}(t_{2}))<\kappa\end{aligned}\right.italic_Y ( italic_i , italic_j , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , italic_c italic_o italic_s italic_s italic_i italic_m ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≥ italic_κ end_CELL end_ROW start_ROW start_CELL 0 , italic_c italic_o italic_s italic_s italic_i italic_m ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) < italic_κ end_CELL end_ROW

This function calculates the cosine similarity between the i𝑖iitalic_i-th channel and the j𝑗jitalic_j-th channel at time slices t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The cossim𝑐𝑜𝑠𝑠𝑖𝑚cossimitalic_c italic_o italic_s italic_s italic_i italic_m represents the cosine similarity, and the κ𝜅\kappaitalic_κ is the predefined threshold to determine whether the two slices are correlated and assign the pseudo label. The binary cross-entropy loss can be used as loss prediction loss to pre-train the model, where the predictions are generated by the encoder-pretext task decoder structure:gδp(fθ(X1,X2))superscriptsubscript𝑔𝛿𝑝subscript𝑓𝜃subscript𝑋1subscript𝑋2g_{\delta}^{p}(f_{\theta}(X_{1},X_{2}))italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ).

Replace discriminative task(Cai et al., 2023) is the binary classification task to extract channel-specific differential features by identifying distinct components from different channels. In this task, a random replacement is performed to replace a certain percentage pr%percentsubscript𝑝𝑟p_{r}\%italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT % of original EEG signal Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with signal Xi¯¯subscript𝑋𝑖\overline{X_{i}}over¯ start_ARG italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG sampled at any channels and time slices. The pseudo labels are constructed to indicate whether the current samples have been replaced, which can be described as follows:

(6) Y(Xi)={1,fI(Xi,i)=00,fI(Xi,i)0\displaystyle Y(X_{i})=\left\{\begin{aligned} 1,f_{I}(X_{i},i)=0\\ 0,f_{I}(X_{i},i)\neq 0\end{aligned}\right.italic_Y ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) = 0 end_CELL end_ROW start_ROW start_CELL 0 , italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) ≠ 0 end_CELL end_ROW

where fI(Xi,i)subscript𝑓𝐼subscript𝑋𝑖𝑖f_{I}(X_{i},i)italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) is the function to judge whether the signal Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has been replaced or not. Subsequently, by minimizing the binary cross-entropy pretext task loss, the model learns the distinctive spatial features of different channels and retains essential information beneficial for various downstream tasks.

3.2. Temporal Predictive Method

The temporal predictive methods aim to capture the temporal correlation and sequential dependencies in EEG signals. As the temporal physiological signal, temporal characteristics play an important role in various EEG-based tasks. Figure 4b shows a typical framework of temporal predictive SSL for EEG analysis, and different temporal predictive pretext tasks have been proposed to investigate potential temporal information, which are summarized as follows:

Relative positioning task is the temporal predictive method to distinguish whether two different EEG segments are close or distinct in time dimension (Banville et al., 2021). This task firstly constructs an EEG pair Xti,Xtic×tsubscript𝑋subscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡𝑖superscript𝑐𝑡X_{t_{i}},X_{t^{{}^{\prime}}_{i}}\in{\mathbb{R}}^{c\times t}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t end_POSTSUPERSCRIPT represent two sampled EEG segments. Representation of EEG signals should change slowly over time, which means EEG segments proximate in the time dimension convey similar information, and those further apart exhibit significant dissimilarities (Banville et al., 2021). The duration parameter τpossubscript𝜏𝑝𝑜𝑠\tau_{pos}italic_τ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT controls the duration of positive context. For two EEG segmentsXtisubscript𝑋subscript𝑡𝑖X_{t_{i}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Xtisubscript𝑋subscriptsuperscript𝑡𝑖X_{t^{{}^{\prime}}_{i}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the temporal interval |titi|τpossubscript𝑡𝑖subscriptsuperscript𝑡𝑖subscript𝜏𝑝𝑜𝑠|t_{i}-t^{{}^{\prime}}_{i}|\leq\tau_{pos}| italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_τ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT indicates that these segments are within positive duration, sharing common underlying labels. Therefore, the pseudo labels of the relative positioning task can be constructed as follows:

(7) Y(Xti,Xti)={1,|titi|τpos1,|titi|>τpos\displaystyle Y(X_{t_{i}},X_{t^{{}^{\prime}}_{i}})=\left\{\begin{aligned} 1,|t% _{i}-t^{{}^{\prime}}_{i}|\leq\tau_{pos}\\ -1,|t_{i}-t^{{}^{\prime}}_{i}|>\tau_{pos}\end{aligned}\right.italic_Y ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_τ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 , | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > italic_τ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT end_CELL end_ROW

where 11-1- 1 and 1111 represent samples of negative and positive duration. The training sample SN={(Xti,Xti),Y((Xti,Xti)}S_{N}=\{(X_{t_{i}},X_{t^{{}^{\prime}}_{i}}),Y((X_{t_{i}},X_{t^{{}^{\prime}}_{i% }})\}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_Y ( ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } can be used to train the model with binary classification loss to capture temporal information. This method has been widely used for continuous EEG classification tasks such as the sleep stage classification (Banville et al., 2021).

Temporal Shuffling is considered as the variation of the relative positioning task (Banville et al., 2021). The temporal shuffling task first samples two different EEG segments Xti,Xtisubscript𝑋subscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡𝑖X_{t_{i}},X_{t^{{}^{\prime}}_{i}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT from positive duration, and then samples another EEG segment Xti′′subscript𝑋subscriptsuperscript𝑡′′𝑖X_{t^{{}^{\prime\prime}}_{i}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT between the first two segments or in the negative duration. Three different segments form the triplet (Xti,Xti,Xti′′)subscript𝑋subscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡′′𝑖(X_{t_{i}},X_{t^{{}^{\prime}}_{i}},X_{t^{{}^{\prime\prime}}_{i}})( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The shuffling operation is performed to permute the order of the segments in the triplet randomly. The pseudo labels indicating whether the triplet has been shuffled can be constructed: 0 for the shuffled triplet and 1 for the normal triplet. Then, the model learns to distinguish whether the triplet has been shuffled through the concatenated differential features between segments, which can be calculated as follows:

(8) D(Xti,Xti,Xti′′)=concat(|fθ(Xti)fθ(Xti)|,|fθ(Xti)fθ(Xti′′)|)𝐷subscript𝑋subscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡𝑖subscript𝑋subscriptsuperscript𝑡′′𝑖𝑐𝑜𝑛𝑐𝑎𝑡subscript𝑓𝜃subscript𝑋subscript𝑡𝑖subscript𝑓𝜃subscript𝑋subscriptsuperscript𝑡𝑖subscript𝑓𝜃subscript𝑋subscriptsuperscript𝑡𝑖subscript𝑓𝜃subscript𝑋subscriptsuperscript𝑡′′𝑖D(X_{t_{i}},X_{t^{{}^{\prime}}_{i}},X_{t^{{}^{\prime\prime}}_{i}})=concat(|f_{% \theta}(X_{t_{i}})-f_{\theta}(X_{t^{{}^{\prime}}_{i}})|,|f_{\theta}(X_{t^{{}^{% \prime}}_{i}})-f_{\theta}(X_{t^{{}^{\prime\prime}}_{i}})|)italic_D ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_c italic_o italic_n italic_c italic_a italic_t ( | italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | , | italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | )

where cancat𝑐𝑎𝑛𝑐𝑎𝑡cancatitalic_c italic_a italic_n italic_c italic_a italic_t is the vector concatenation operation. The model conducts shuffling classification by utilizing differential encoded information from different segments as features, which can help to comprehend temporal dependencies within EEG signals. Besides, another temporal shuffling method proposed by (Ou et al., 2022) divides the EEG slice into three equidistant sequences, then randomly shuffles the order of sequences to form the shuffled sample. The model is asked to predict the order of input shuffled samples to capture temporal correlation. Therefore, for shuffled EEG signals, both binary classification (predict whether they have been shuffled ) and multi-class classification (predict the order of shuffled signals) can serve as the pretext task to extract temporal features of physiological signals at different granularities.

Temporal trend prediction(Ko and Suk, 2022) is a task to identify the potential trends of EEG to capture short-term and long-term dynamic patterns. This task divides the EEG signal into three categories according to its temporal characteristics: stationary, trendstationary, and cyclostationary. By learning how to identify temporal trends, the model can comprehend the temporal relationships within signals and capture both global and local essential waveform information to generate the temporally enriched representations, which can benefit a variety of downstream tasks, like sleep stage classification.

Time shift prediction(Accou et al., 2023) is a task to predict the time shift performed to the EEG signals by contrasting the differences in features between the raw EEG signal and shifted signals. In this task, the raw EEG signal Xtisubscript𝑋subscript𝑡𝑖X_{t_{i}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and augmented EEG signal Xti+ρsubscript𝑋subscript𝑡𝑖𝜌X_{t_{i}+\rho}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ρ end_POSTSUBSCRIPT resulting from ρ𝜌\rhoitalic_ρ-step time shifts applied to the raw EEG signal. The raw signal and shifted signal are encoded into representations, and the pretext task uses a classification method to analyze the difference between the two representations and classify how much the raw signal was shifted. By minimizing the classification loss, the encoder can learn the temporal-aligned features and dependencies within EEG signals, generating the representation containing rich time information significantly beneficial for long-term EEG tasks like clinical monitoring.

3.3. Transformation Predictive Method

Figure 5 shows the general process of the transformation predictive method. This task aims to predict specific transformations applied to the EEG signals to learn signal-related features in the time-frequency domain. Different EEG transformation techniques employed to augment EEG samples to be recognized in this task can be listed as follows:

Refer to caption

Figure 5. The general process of transformation predictive method for EEG analysis. Different signal transformation techniques are applied to EEG signals to generate augmented samples and pseudo labels. The model can capture critical signal-level features for downstream tasks by correctly predicting the transformation method.

Stopped band prediction randomly removes specific frequency bands in EEG signals and forces the model to predict the index of the removed channel to learn frequency-related features (Jo et al., 2023). EEG signals comprise information from multiple frequency bands, with essential information concentrated within the frequency range of 1 to 50 Hz, encompassing five independent frequency bands: δ𝛿\deltaitalic_δ (0.5-4Hz), θ𝜃\thetaitalic_θ (4-8Hz), α𝛼\alphaitalic_α (8-12Hz), β𝛽\betaitalic_β (12-30Hz) and γ𝛾\gammaitalic_γ (30-50Hz). This task transforms the EEG signal from time to the frequency domain and remaps the signal to the time domain after the random removal of a specific frequency band, and the pseudo labels Y(xi)[0,1,2,3,4]𝑌subscript𝑥𝑖01234Y(x_{i})\in[0,1,2,3,4]italic_Y ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ [ 0 , 1 , 2 , 3 , 4 ] are set representing the index of the removed band. By forcing the model to predict the stopped band through encoded representation fθ(Xi)subscript𝑓𝜃subscript𝑋𝑖f_{\theta}(X_{i})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can learn efficient frequency correlation and features to form the temporal-frequency representation.

Multi-transformation recognition aims to predict the transformation technique used to augment EEG signals to extract fine-grained signal features and form effective representations (Wang et al., 2023). In this task, EEG signals are augmented through one transformation technique, and the model is asked to recognize the transformation methods. The common encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT extracts features from augmented EEG signals and encodes them into representation, with multiple binary classifiers to recognize different transformation methods. Each classifier corresponds to a specific transformation method to determine its occurrence. Six transformation methods are proposed to be recognized:

(1)Noise Adding. Adding random noise generated by Gaussian distribution N(μ,σ2)𝑁𝜇superscript𝜎2N~{}(\mu,\sigma^{2})italic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The noise NSi𝑁subscript𝑆𝑖NS_{i}italic_N italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT directly added to the original signal to Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, resulting in a noise-augmented signal Xinssubscriptsuperscript𝑋𝑛𝑠𝑖X^{ns}_{i}italic_X start_POSTSUPERSCRIPT italic_n italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

(2)Scale transformation alters the waveform of the EEG signals. The amplitude of EEG signal is stretched or telescoped through a scale factor α𝛼\alphaitalic_α, where the scale-augmented signal can be expressed as Xist=α*Xisubscriptsuperscript𝑋𝑠𝑡𝑖𝛼subscript𝑋𝑖X^{st}_{i}=\alpha*X_{i}italic_X start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α * italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

(3)Horizontal flip**. This transformation method directly flips EEG signal horizontally. The horizontal-augmented signal can be expressed as Xih=Xisubscriptsuperscript𝑋𝑖subscript𝑋𝑖X^{h}_{i}=-X_{i}italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

(4)Vertical flip** flips EEG signal (each sample) vertically. The vertical-augmented signal can be described as Xiv=flip(Xi)subscriptsuperscript𝑋𝑣𝑖𝑓𝑙𝑖𝑝subscript𝑋𝑖X^{v}_{i}=flip(X_{i})italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f italic_l italic_i italic_p ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where the flip𝑓𝑙𝑖𝑝flipitalic_f italic_l italic_i italic_p represents the segment’s vertical symmetric flip.

(5)Temporal dislocation is consistent with the temporal shuffling in predictive methods. This method divides EEG segments into sub-segments and randomly shuffles these sub-segments to form the dislocation-augmented signal Xitdsubscriptsuperscript𝑋𝑡𝑑𝑖X^{td}_{i}italic_X start_POSTSUPERSCRIPT italic_t italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

(6)Time war** method randomly stretches and compresses sub-segments to form the augmented samples. This method randomly selects sub-segments to stretch and compress, with the recombination method to reassemble all the sub-segments and construct the war**-augmented signal Xitwsubscriptsuperscript𝑋𝑡𝑤𝑖X^{tw}_{i}italic_X start_POSTSUPERSCRIPT italic_t italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the same dimension as the origin EEG signal.

By recognizing different transformation techniques, the model can generate the representation that captures temporal dependencies, frequency correlation, and time-frequency correspondences within EEG signals from unlabeled samples.

Table 1. The summarization of predictive-based EEG analysis self-supervised learning method. In the ”training mode” column, ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, and ”CT” represents joint-training mode.
Approach Type of pretext method

Detailed method

Backbone

Downstream Tasks

Training Mode

EEG scaling SSL (Xu et al., 2020)

transformation predictive

scaling prediction

SVM

epileptic classification

PT

Transformation SSL (Wang et al., 2023)

transformation predictive

multi-transformation

CNN

emotion recognition

UT

Task-agnostic SSL (Partovi et al., 2023)

transformation predictive

multi-transformation

CNN

seizure & motor imagery

UT

Temporal EEG SSL (Gramfort et al., 2021)

temporal predictive

relative position

-

sleep & pathology prediction

PT

SSL-EED AD (Zheng et al., 2022)

transformation predictive

noise classification

CNN,SVM

pathology prediction

UT

Speech-EEG SSL (Accou et al., 2023)

temporal predictive

temporal-shift

CNN

speech decoding

UT

SSL MI-EEG (Ou et al., 2022)

temporal predictive

temporal shuffling

CNN

motor imagery

PT

MtCLSS (Li et al., 2022b)

transformation predictive

multi-transformation

CNN

pediatric sleep classification

CT

MM Emotion (Montero Quispe et al., 2022)

transformation predictive

multi-transformation

CNN

Emotion recognition

PT

SSTSC (Xi et al., 2022)

transformation predictive

Relative position

CNN

Seizure detection

PT

Clinical EEG SSL (Banville et al., 2021)

temporal predictive

relative position temporal shuffling

CNN

sleep classification pathology classification

PT

SSL for sleep EEG (Banville et al., 2019)

temporal predictive

relative position temporal shuffling

CNN

sleep classification

PT

EEG-oriented SSL (Ko and Suk, 2022)

transformation predictive temporal predictive

band-stop prediction temporal-trend

CNN

sleep & Pathology motor imagery

PT

Robust EEG SSL (Jo et al., 2023)

transformation predictive temporal predictive

band-stop prediction temporal-trend

CNN

sleep & Pathology

PT

MBrain (Cai et al., 2023)

Spatial predictive

channel correlation replace discriminative

CNN,LSTM

Seizure detection

PT

3.4. Section Discussion

This section extensively reviews the predictive-based EEG analysis methods. In this section, we categorize the predictive methods into three sub-categories: spatial predictive, temporal predictive, and transformation predictive method. The spatial predictive tasks focus on exploring the channel-correlation features. In contrast, the temporal predictive tasks involve incorporating rich temporal dependency features, time-correlation features, and consistent temporal information into the representation. The transformation predictive task can help the model to extract temporal-frequency aligned features by recognizing typical signal transformation techniques. Those pretext tasks are simple to accomplish, where the encoder gthetasubscript𝑔𝑡𝑒𝑡𝑎g_{t}hetaitalic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h italic_e italic_t italic_a can be CNN or LSTM, and the pretext task decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT can be simple forward neural networks or the traditional machine learning classifiers. The predictive tasks only require a few parameters and complex network architectures but may need help to learn general representations for downstream tasks.

4. Generative-based SSL EEG analysis method

Different from predictive methods, generative-based SSL EEG analysis methods are more complex and challenging. The critical terms of this method are ”Reconstruction” and ”Generation,” where the fine-grained correlation and features can be captured through this pretext task. ”Reconstruction” means reconstructing the masked or transformed samples to learn effective representation, and ”Generation” means generating specific context to train the model to learn specific knowledge. In the EEG analysis, the generative-based SSL method adopts signal reconstruction and

Refer to caption
(a) The typical framework of temporal reconstruction method, where EEG signals are randomly masked and the model is required to reconstruct raw EEG signals to extract signal contextual features
Refer to caption
(b) The typical framework of multi-domain reconstruction method, where the frequency-temporal features of EEG signals are randomly masked and the model is required to reconstruct the features to capture multi-domain correlation in EEG signals
Figure 6. The frameworks of temporal reconstruction and multi-domain reconstruction SSL EEG analysis methods

generative-adversarial task as the pretext tasks, which can be categorize into three independent sub-categories according to the task target: (1) Temporal reconstruction task, (2) Multi-domain reconstruction task, and (3) Generative adversarial task. The typical frameworks of three kinds of methods are demonstrated in Figure 6 and Figure 7, and the summary of existing works is listed in Table 2.

4.1. Temporal Reconstruction Task

The framework of the temporal reconstruction task is shown in Figure 6a, which is inspired by the autoencoder method (Zhai et al., 2018) to reconstruct the input data to capture contextual features without the need for human-labeled sample. The reconstruction task enables the encoder to learn fine-grained input correlation, which can generate representations containing rich contextual information. The EEG signal is the serialized temporal physiological data, which is applicable for conducting the temporal reconstruction pretext task (Devlin et al., 2018) to learn signal contextual correlation, enhance the understanding of temporal dependencies, and provide effective representation for various EEG-based downstream tasks. Different temporal reconstruction tasks are listed as follows:

EEG-based autoencoder (Huang et al., 2023) is the adaptation of autoencoder for EEG analysis (Hinton and Salakhutdinov, 2006). In this method, EEG signals are encoded into low-dimensional representation by the encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and the low-dimensional representation is then used to reconstruct the original signal through the pretext task decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT symmetrical to the encoder. The encoder is responsible for preserving critical EEG signal information, while the decoder is responsible for reconstructing the EEG signal from the generated representation. The reconstruction loss can be calculated as follows:

(9) (X,X*)=1len(X)XX*1𝑋superscript𝑋1𝑙𝑒𝑛𝑋subscriptnorm𝑋superscript𝑋1\mathcal{L}(X,X^{*})=\frac{1}{len(X)}\|X-X^{*}\|_{1}caligraphic_L ( italic_X , italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_l italic_e italic_n ( italic_X ) end_ARG ∥ italic_X - italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where len(X)𝑙𝑒𝑛𝑋len(X)italic_l italic_e italic_n ( italic_X ) represents the length of input signal and 1\|\|_{1}∥ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the L1-norm. X𝑋Xitalic_X is the EEG signal and X*superscript𝑋X^{*}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the reconstructed signal. By minimizing the difference between the original and reconstructed signals, the encoder preserves critical information necessary for signal recovery, which can be considered signal compression.

Signal-level mask-reconstruction (signal MAE) is the typical mask-reconstruction method to capture temporal signal correlation for reconstructing the masked segments (Chien et al., 2022). In this framework, multi-channel EEG signals are first encoded into temporal embeddings through the 1D convolution block similar to the famous wave2vec (Schneider et al., 2019; Baevski et al., 2020) algorithm. The high-dimensional EEG signals X𝑋Xitalic_X are downsampled and compressed into low-dimensional feature embeddings Z={z1,z2,z3,,zk}𝑍subscript𝑧1subscript𝑧2subscript𝑧3subscript𝑧𝑘Z=\{z_{1},z_{2},z_{3},...,z_{k}\}italic_Z = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } arranged in temporal order, and the stride of convolution block determines the number of input time-step to the encoder. The mask Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is generated to randomly replace parts of the information in embedding Z𝑍Zitalic_Z, creating the masked embedding Zmsuperscript𝑍𝑚Z^{m}italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with local information dropout. The transformer encoders are then applied to extract bidirectional temporal correlation between input slices and output the signal representation Re={re1,re2,,rek}𝑅𝑒𝑟subscript𝑒1𝑟subscript𝑒2𝑟subscript𝑒𝑘Re=\{re_{1},re_{2},...,re_{k}\}italic_R italic_e = { italic_r italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. The convolution block and Transformer encoder are cascade as the encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to generate signal representation, and followed by the linear and convolution layer as the pretext task decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to reconstruct the raw signal X𝑋Xitalic_X through masked signal representation Re𝑅𝑒Reitalic_R italic_e, the training process can be described by minimizing cosine similarity loss shown as follows:

(10) (Xm,X)=1XmX|Xm||X|subscript𝑋𝑚𝑋1subscript𝑋𝑚𝑋subscript𝑋𝑚𝑋\mathcal{L}(X_{m},X)=1-\frac{X_{m}\cdot X}{|X_{m}||X|}caligraphic_L ( italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_X ) = 1 - divide start_ARG italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_X end_ARG start_ARG | italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | | italic_X | end_ARG

where Xmsubscript𝑋𝑚X_{m}italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the reconstructed EEG signal. This method reconstruct raw signal through the masked signal representation, where the original EEG signals serve as the pseudo-labels. In this architecture, encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is responsible for mining temporal correlation and preserving critical signal information while the pretext decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is responsible for reconstructing original EEG signals. Therefore, this framework forces the encoder to generate representations containing fine-grained correlation and signal critical information, which exhibits strong expressiveness, generalization, and applicability across various EEG-based tasks.

Embedding-level mask-reconstruction (embedding MAE) is another mask-reconstruction method that is inspired by the BERT model(Devlin et al., 2018) in the language domain to fuse the contextual relationship into the representation (Kostas et al., 2021). Similar to the MAE framework mentioned above, EEG signals are first transformed into temporal embeddings through the convolution block and then encoded by the transformer encoder to generate signal representations, followed by the pretext task decoder to accomplish the reconstruction task. However, different from the signal MAE, this task is based on the embedding-level reconstruction: the transformed EEG embedding Z𝑍Zitalic_Z is randomly masked by the generated mask vector, where z*superscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT represents the randomly selected embeddings to be masked. The pretext task decoder is required to reconstruct the masked embedding rather than EEG raw signals. The contrastive loss function is designed to make the reconstructed embedding zpre*subscriptsuperscript𝑧𝑝𝑟𝑒z^{*}_{pre}italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT to be as similar as possible to the original unmasked embedding z*superscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT while kee** it as dissimilar as possible to the remaining embeddings, which can be calculated as follows:

(11) (z*,zpre*)=logexp(sim(zpre*,z*)/η)zriZexp(sim(zpre*,zri)/η)superscript𝑧subscriptsuperscript𝑧𝑝𝑟𝑒𝑒𝑥𝑝𝑠𝑖𝑚subscriptsuperscript𝑧𝑝𝑟𝑒superscript𝑧𝜂subscriptsubscript𝑧subscript𝑟𝑖𝑍𝑒𝑥𝑝𝑠𝑖𝑚subscriptsuperscript𝑧𝑝𝑟𝑒subscript𝑧subscript𝑟𝑖𝜂\mathcal{L}(z^{*},z^{*}_{pre})=-\log\frac{exp(sim(z^{*}_{pre},z^{*})/\eta)}{% \sum_{z_{r_{i}}\in{Z}}exp(sim(z^{*}_{pre},z_{r_{i}})/\eta)}caligraphic_L ( italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT ) = - roman_log divide start_ARG italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) / italic_η ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_Z end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / italic_η ) end_ARG

where zrisubscript𝑧subscript𝑟𝑖z_{r_{i}}italic_z start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the negative sample obtained by random sampling from the contextual embeddings, sim𝑠𝑖𝑚simitalic_s italic_i italic_m represents cosine similarity to measure the distance between the reconstructed and original embeddings, and η𝜂\etaitalic_η is the temperature parameter to control the contrastive loss. Compared with the MAE to reconstruct the original signals, the embedding-level reconstruction is simpler, with fewer parameters to capture critical contextual information precisely and understand EEG embedding temporal relationships. However, it may also lead to losing some original signal information. The combination of the transformation encoder can generate representations for various downstream tasks.

4.2. Multi-domain Reconstruction Method

The multi-domain reconstruction method extends the EEG-based MAE to multiple domains (signal, spatial, and frequency), which can be shown in Figure 6b. Different from the temporal reconstruction methods, this method achieves collaborative and mutual reconstruction across different domains to extract spatial-temporal-frequency aligned and complementary features in the EEG signal, generating more powerful and general representations adapted to different tasks. Detailed explanations of multi-domain reconstruction methods are as follows:

Spatial-temporal-frequency reconstruction (STF MAE) conducts the synergistic reconstruction task in the temporal-frequency-spatial domain to extract integrated EEG features (Chen et al., 2023). The idea of synergistic reconstruction task is inspired by the time-frequency analysis method (Roach and Mathalon, 2008): temporal analysis method (Balconi and Lucchiari, 2006; Singh and Malhotra, 2022) investigates the patterns of EEG amplitude changes over time, while frequency analysis (Harpale and Bairagi, 2016) method studies the frequency energy distribution within EEG signals. The time-frequency analysis method utilizes the sliding time window to investigate temporal changes in frequency spectral features (Zhang et al., 2008). Based on the time-frequency analysis method, this task constructs a 3D matrix as the feature of the EEG signal. Through the continuous wavelet transform (CWT) (Aguiar-Conraria and Soares, 2014), EEG signals are transformed into 3D frequency-spatial-temporal matrix Xc×tn×fr𝑋superscript𝑐subscript𝑡𝑛subscript𝑓𝑟X\in\mathbb{R}^{c\times t_{n}\times f_{r}}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where c𝑐citalic_c is the channel number, tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the number of temporal slices, and frsubscript𝑓𝑟f_{r}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents frequency feature resolution. This 3D matrix can be considered time-frequency features (2D image) with multiple channels. Inspired by the image MAE, the EEG feature matrix is divided into different patches and randomly masked with the mask patch mpsubscript𝑚𝑝m_{p}italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to generate the masked matrix X*superscript𝑋X^{*}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, the encoder-decoder structure utilizing vision-transformer (ViT) (Dosovitskiy et al., 2020) as the backbone is designed to reconstruct the EEG feature matrix. The mean squared error (MSE) can be used to train the model, which is defined as follows:

(12) (Xm,Xpre)=E(XpreXm)2=1nmi=1nm(XipreXim)2superscript𝑋𝑚superscript𝑋𝑝𝑟𝑒𝐸superscriptsuperscript𝑋𝑝𝑟𝑒superscript𝑋𝑚21subscript𝑛𝑚superscriptsubscript𝑖1subscript𝑛𝑚superscriptsuperscriptsubscript𝑋𝑖𝑝𝑟𝑒superscriptsubscript𝑋𝑖𝑚2\mathcal{L}(X^{m},X^{pre})=E(X^{pre}-X^{m})^{2}=\frac{1}{n_{m}}\sum_{i=1}^{n_{% m}}(X_{i}^{pre}-X_{i}^{m})^{2}caligraphic_L ( italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT ) = italic_E ( italic_X start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where nmsubscript𝑛𝑚n_{m}italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the dimension of the masked features Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and Xpresuperscript𝑋𝑝𝑟𝑒X^{pre}italic_X start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT represents the reconstructed features generated by the encoder and pretext task decoder: Xpre=gδp(fθ(X*))superscript𝑋𝑝𝑟𝑒superscriptsubscript𝑔𝛿𝑝subscript𝑓𝜃superscript𝑋X^{pre}=g_{\delta}^{p}(f_{\theta}(X^{*}))italic_X start_POSTSUPERSCRIPT italic_p italic_r italic_e end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ). By minimizing the MSE loss, encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT fuses spatial-temporal-frequency contextual correlation into representation, and decoder gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is learned to reconstruct the original EEG feature matrix based on the representation. The generated representations contain multi-domain correlation and features, which exhibit greater expressive ability and a wider range of applications for downstream tasks.

Frequency mask-reconstruction (frequency MAE) conducts mask-reconstruction task in different frequency bands to capture frequency features, long-term dependencies, and critical time-frequency correlated information (Peng et al., 2023). Initially, the EEG signal undergoes two distinct transformations: 1. The EEG signal is directly embedded into the patch sequence through division, linear projection, and flattening operation, representing the EEG temporal patch sequence. 2. The EEG signal is transformed into six independent frequency bands (0-4Hz, 4-8Hz, 8-18Hz, 16-32Hz, 32-64Hz, and other frequencies), representing the EEG frequency patch sequences. 10%percent1010\%10 % of the frequency patch sequences are randomly masked, followed by six independent ViT-based encoders to generate representations for all frequency bands (one encoder corresponds to one frequency band). Six independent ViT-based decoders are sequentially used to reconstruct the frequency patch sequences. Differently, the target of the pretext task is to minimize the difference between the summation of all reconstructed frequency sequences and the temporal patch sequence, which can be calculated similarly to equation (12). This task can reconstruct the temporal information by the masked frequency patch sequences, which can help the model to align temporal-frequency information in the EEG signal and understand its correlation, generating high-dimensional representations with rich time-frequency coherent features of the signals, and providing valuable EEG features for various EEG-based tasks.

Frequency-temporal reconstruction (FT MAE) is the framework to reconstruct the masked EEG representations in the frequency and time domain (Wu et al., 2022). This framework transforms EEG signals into discrete patches through the non-overlap** 1D-CNN, with some patches randomly masked with ratio r𝑟ritalic_r. The ViT-based encoder is subsequently employed to generate representations, followed by a symmetric decoder to reconstruct the masked patches. Two reconstruction methods are proposed in the framework: the first is the spatiotemporal domain reconstruction, where the decoder reconstructs the masked patches directly, with the MSE loss function to train the model and capture the temporal correlations in EEG signal. The second is the Fourier domain reconstruction to reconstruct masked patches in the frequency domain. Through the Discrete Fourier Transform (DFT) (Bagchi and Mitra, 2012), EEG signals can be transformed from the time domain to the frequency domain:

(13) xkf=j=1nxj*cos(2πnjk)𝐢*sin(2πnjk)subscriptsuperscript𝑥𝑓𝑘superscriptsubscript𝑗1𝑛subscript𝑥𝑗2𝜋𝑛𝑗𝑘𝐢2𝜋𝑛𝑗𝑘x^{f}_{k}=\sum_{j=1}^{n}x_{j}*\cos(\frac{2\pi}{n}jk)-\mathbf{i}*\sin(\frac{2% \pi}{n}jk)italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT * roman_cos ( divide start_ARG 2 italic_π end_ARG start_ARG italic_n end_ARG italic_j italic_k ) - bold_i * roman_sin ( divide start_ARG 2 italic_π end_ARG start_ARG italic_n end_ARG italic_j italic_k )

where k(0,n)𝑘0𝑛k\in(0,n)italic_k ∈ ( 0 , italic_n ), n𝑛nitalic_n is the number of sampling points for the EEG segments, xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the temporal amplitude at sampling point j𝑗jitalic_j, and 𝐢𝐢\mathbf{i}bold_i represents the imaginary unity. xkfsubscriptsuperscript𝑥𝑓𝑘x^{f}_{k}italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the generated spectrum features at sampling point k𝑘kitalic_k. The first term in this equation represents the Real part of the result, and the second term represents the imagery part. Then, the magnitude and phase of the frequency signal can be calculated as follows:

(14) {magnitudek=1nRe(xkf)2+Im(xkf)2phasek=atan2(Re(xkf)2,Im(xkf)2)\displaystyle\left\{\begin{aligned} magnitude_{k}=\frac{1}{n}\sqrt{Re(x^{f}_{k% })^{2}+Im(x^{f}_{k})^{2}}\\ phase_{k}=atan2(Re(x^{f}_{k})^{2},Im(x^{f}_{k})^{2})\end{aligned}\right.{ start_ROW start_CELL italic_m italic_a italic_g italic_n italic_i italic_t italic_u italic_d italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG square-root start_ARG italic_R italic_e ( italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I italic_m ( italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL italic_p italic_h italic_a italic_s italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_a italic_t italic_a italic_n 2 ( italic_R italic_e ( italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_I italic_m ( italic_x start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW

where Re𝑅𝑒Reitalic_R italic_e and Im𝐼𝑚Imitalic_I italic_m represent the real and imagery part of the spectrum feature, and atan2𝑎𝑡𝑎𝑛2atan2italic_a italic_t italic_a italic_n 2 represents the arctangent function with two arguments. Researchers believe that the study of both magnitude and phase is important: For EMG signals, the magnitude and phase are highly correlated with muscle movement. Muscles move both longitudinally and transversely according to the direction of the fibers. As a result, the biological impedance of the motion units changes, leading to variations in amplitude and phase responses. Therefore, analyzing magnitude and phase can help the model capture muscle contraction patterns as part of the representation learning process (Wu et al., 2022). In the EEG signal, the magnitude and phase are highly related to the phase synchronization information between neurons, which can help reveal the synchrony and information transmission between different brain regions. The Fourier domain reconstruction task predicts the magnitude and phase sequence of masked EEG patches, which are then reconstructed through the inverse Fourier transform. The mean squared error can measure the difference between the original patches and those reconstructed by magnitude and phase. Encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can understand the correlation between spectrum features and temporal signal and capture critical neuron activity knowledge through this task.

Spatial reconstruction (Spatial MAE) aims to learn the spatial correlation between different channels in EEG signal (Ho and Armanfard, 2023). In this framework, the correlation between EEG channels can be defined using a graph structure 𝒢=(𝒜,𝒳)𝒢𝒜𝒳\mathcal{G}=(\mathcal{A},\mathcal{X})caligraphic_G = ( caligraphic_A , caligraphic_X ), where 𝒳c×n𝒳superscript𝑐𝑛\mathcal{X}\in\mathbb{R}^{c\times n}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_n end_POSTSUPERSCRIPT represents the node feature matrix in the graph (each channel corresponds to a node on the graph) and 𝒜c×c𝒜superscript𝑐𝑐\mathcal{A}\in\mathbb{R}^{c\times c}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_c end_POSTSUPERSCRIPT is the adjacency matrix representing the connectivity between nodes. The graph structure can be calculated through channel spatial distance and correlation. In the framework, the sub-graph 𝒢ssuperscript𝒢𝑠\mathcal{G}^{s}caligraphic_G start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is sampled containing nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT nodes and their connectivity graph structure. For the sampled sub-graph, the feature of a random node is masked and then reconstructed by the model through adjacent node features and graph structure, which can train the model to capture the spatial correlations. The graph neural network (GNN) (Wu et al., 2020) is used as the backbone for the encoder and decoder to deal with topological graph data, and the MSE is the loss to measure the node reconstruction performance. By reconstructing the graph node, the generated representation enables a deeper exploration of spatial features and channel correlation, which is valuable for tasks that require high spatial resolution in EEG (such as visual decoding).

Transformation reconstruction aims to reconstruct EEG signals after different signal transformations to preserve critical signal-related information (Das et al., 2022). The model reconstructs the EEG signal after the following signal transformations: 1. signal jitter, where EEG samples are added with random noise:xs=x+ssubscript𝑥𝑠𝑥𝑠x_{s}=x+sitalic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_x + italic_s. 2. Random sample, where some points in the temporal EEG signal are replaced by the average value of neighbor points. This transformation can be considered as the smoothing operation. 3. Channel removal, where a specific channel in the EEG signal is removed to be reconstructed. 4. Window replace, where EEG signals in randomly selected time windows are replaced by dummy value zero. 5. Jitter in time windows, where the signal in the randomly selected time window is corrupted by noise. The model can fuse temporal correlation, spatial correlation, and transformation features into representation for downstream tasks by reconstructing raw signals from various transformations in the pre-training process.

4.3. Generative Adversarial Method

The generative adversarial method encompasses two pretext tasks: the generation task to generate fake EEG samples continually, and the adversarial task strives to distinguish real and fake samples (shown in Figure 7) (Creswell et al., 2018; Goodfellow et al., 2014). Through self-supervised the adversarial training of the generator and discriminator in the framework, the model can generate enhanced EEG samples. In the field of EEG, two kinds of generative adversarial networks (GANs) have been investigated, which are listed as follows:

Refer to caption

Figure 7. The general framework of generative adversarial network (GAN) for EEG analysis

Sample generation method aims to produce new EEG samples through the generation and adversarial pretext tasks (Bhat and Hortal, 2021). This framework uses the generator G𝐺Gitalic_G and discriminator D𝐷Ditalic_D to accomplish the generation adversarial task. The input of G𝐺Gitalic_G are augmented EEG signals (e.g. masked signal) or random noise, and the output of G𝐺Gitalic_G are the generated fake EEG samples. The input of D𝐷Ditalic_D are the sample pairs (xn,x)superscript𝑥𝑛𝑥(x^{n},x)( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x ), where xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the generated fake sample and x𝑥xitalic_x is the true EEG signal. The generator aims to produce pseudo-samples highly similar to real EEG samples, while the discriminator attempts to distinguish between real and fake samples accurately. Through adversarial training, the generator can produce highly believed EEG samples for training, which can help alleviate EEG collection and labeling issues.

Discriminator-based GAN is another generative adversarial method to extract discriminative representations from EEG signal (Fu et al., 2022). Discriminator-based GAN focuses on the discriminator to extract efficient features. By distinguishing real samples from fake ones, the discriminator can learn critical invariant and discriminative features of EEG signals. Through adversarial training, the discriminator can be considered as the encoder that can extract pre-trained EEG features and generate representations for downstream tasks.

4.4. Section Discussion

This section reviews the generative-based SSL EEG analysis methods, which conduct complex generative pretext tasks to train the encoder to capture effective signal features for downstream tasks. The existing methods are categorized into three sub-categories: 1) The temporal reconstruction task that masks part of the temporal signal and requires the model to reconstruct. 2) The multi-domain reconstruction task that masks temporal-frequency features and requires the model to reconstruct. 3) The adversarial generative task that generates pseudo sample by generator and requires the discriminator to distinguish real and fake samples. Compared with the predictive tasks, generation tasks are more challenging and need more trainable parameters and complex structures to accomplish, they can learn more efficient features in representation. Emulating the MAE, BERT, and other generative SSL methods in the vision and language field, generative SSL methods for EEG signals have achieved significant success in various downstream tasks.

Table 2. The summarization of generative-based EEG analysis self-supervised learning method. ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, ”CT” represents joint-training mode, and ”SA” represents the sample augmentation.

Approach

Sub-category

Detailed method

Backbone

Downstream Tasks

Training Mode

BENDR(Kostas et al., 2021)

Temporal reconstruction

Embedding MAE

CNN&Transformer

Multiple tasks

PT&UT

GANSER[(Zhang et al., 2022d)

Generative adversarial

Sample generation

CNN(U-NET)

Emotion recognition

SA

EEG-CGS(Ho and Armanfard, 2023)

Multi-domain reconstruction

Spatial MAE

GNN

Seizure analysis

PT

Eeg2vec(Zhu et al., 2023)

Temporal reconstruction

Embedding MAE

CNN&Transformer

Speech decoding

UT

MAEEG(Chien et al., 2022)

Temporal reconstruction

Signal MAE

Transformer

Sleep classification

PT&UT

Cognitive MAE(Pulver et al., 2023)

Temporal reconstruction

Embedding reconstruction

CNN&Transformer

Cognitive-load classification

PT&UT

SSLAPP(Lee et al., 2022)

Generative adversarial

Sample augmentation

Transformer

Sleep classification

SA

MV-SSTMA(Li et al., 2022c)

Multi-domain reconstruction

STF MAE

CNN&Transformer

Emotion recognition

PT

EpilepsyNet(Huang et al., 2023)

Temporal reconstruction

EEG-based autoencoder

CNN

Epileptic classification

JT

EEGMAE(Chen et al., 2023)

Multi-domain reconstruction

STF MAE

ViT

ASD classification

UT&PT

brain2vec(Lesaja et al., 2022)

Temporal reconstruction

Embedding MAE

CNN&Transformer

Speech decoding

PT&UT

Wavelet2vec (Peng et al., 2023)

Multi-domain reconstruction

FT MAE

ViT

Seizure detection

PT

CRT (Zhang et al., 2022b)

Multi-domain reconstruction

STF MAE

Transformer

Sleep classification

UT

Neuro2vec (Wu et al., 2022)

Multi-domain reconstruction

FT MAE

Transformer

Seizure & sleep

PT&UT

SDCAN(Fu et al., 2022)

Generative adversarial

Discriminator-based

CNN

Stress classification

JT

WGAN-GP(Bhat and Hortal, 2021)

Generative adversarial

Sample generation

CNN

Emotion recognition

SA

CWGAN(Jiao et al., 2020)

Generative adversarial

Sample generation

LSTM

Sleep classification

SA

SAE-EEG(Liu et al., 2020)

Temporal reconstruction

EEG-based autoencoder

CNN

Emotion recognition

PT

AE-CDNN(Wen and Zhang, 2018)

Temporal reconstruction

EEG-based autoencoder

CNN

Seizure detection

UT

MI-AE(Mirzaei and Ghasemi, 2021)

Temporal reconstruction

EEG-based autoencoder

CNN

Motor imagery

UT

5. contrastive-based SSL EEG analysis method

Contrastive learning is the most widely used SSL technique in EEG analysis. Contrastive learning framework combined with EEG augmentation methods have been investigated to generate representation that integrates invariant features between positive pairs while eliminating irrelevant features between negative pairs. The target of contrastive learning is to encourage the model to pull positive pairs (similar samples) closer together and push negative samples apart in the representation space, which is defined as follows:

(15) con=defmax(d(x+,x)d(x,x)+α,0)subscript𝑐𝑜𝑛def𝑚𝑎𝑥𝑑superscript𝑥𝑥𝑑superscript𝑥𝑥𝛼0\mathcal{L}_{con}\overset{\mathrm{def}}{=}max(d(x^{+},x)-d(x^{-},x)+\alpha,0)caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT overroman_def start_ARG = end_ARG italic_m italic_a italic_x ( italic_d ( italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ) - italic_d ( italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_x ) + italic_α , 0 )

This loss function is the triplet loss (Schroff et al., 2015) that trains the model to achieve d(x+,x)>d(x,x)+α𝑑superscript𝑥𝑥𝑑superscript𝑥𝑥𝛼d(x^{+},x)>d(x^{-},x)+\alphaitalic_d ( italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x ) > italic_d ( italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_x ) + italic_α, and α𝛼\alphaitalic_α is a small positive number to avoid clustering overfitting. Different augmentation methods are applied to EEG signals to form positive and negative pairs. According to the type of augmentation methods for generating positive and negative sample pairs, we can categorize the contrastive-based SSL EEG analysis method into five sub-categories: (1) Contrastive predictive coding, (2) transformation contrastive learning, (3) spatial contrastive learning, (4) composite contrastive learning, and (5) task-oriented contrastive learning. The typical frameworks of different kinds of methods are demonstrated in Figure 8 to Figure 12, and the summary of existing works is listed in Table 3.

5.1. Contrastive Predictive Coding

Contrastive Predictive Coding (CPC) is a self-supervised learning technique used in NLP and CV for learning high-level representations (Oord et al., 2018; Henaff, 2020). In CPC, data are divided into overlap** context windows, which are used to generate positive and negative pairs. The main idea of CPC is to generate the representation of the context window that can accurately predict the representation of future windows to extract shared invariant features. In the EEG field, two different CPC methods have been investigated:

Refer to caption
Figure 8. The framework of contrastive predictive coding (CPC) for EEG analysis.

EEG-based CPC extends the CPC for EEG analysis (Banville et al., 2021), which can be shown in Figure 8. This method divides EEG signals into time slices through the sliding windows. The context window Xcsubscript𝑋𝑐X_{c}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT contains Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT samples is defined as Xc={xtiNc+1,,xti}subscript𝑋𝑐subscript𝑥subscript𝑡𝑖subscript𝑁𝑐1subscript𝑥subscript𝑡𝑖X_{c}=\{x_{t_{i}-N_{c}+1},...,x_{t_{i}}\}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the temporal index. Similarly, the following predictive window Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is defined as Xp={xti+1,,xti+Np}subscript𝑋𝑝subscript𝑥subscript𝑡𝑖1subscript𝑥subscript𝑡𝑖subscript𝑁𝑝X_{p}=\{x_{t_{i}+1},...,x_{t_{i}+N_{p}}\}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the length of the prediction window. The encoder gensubscript𝑔𝑒𝑛g_{en}italic_g start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT calculates representation zt=gen(xt)subscript𝑧𝑡subscript𝑔𝑒𝑛subscript𝑥𝑡z_{t}=g_{en}(x_{t})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for context window and prediction window, generating context and prediction representation sequence Zc={ztiNc+1,,zti}subscript𝑍𝑐subscript𝑧subscript𝑡𝑖subscript𝑁𝑐1subscript𝑧subscript𝑡𝑖Z_{c}=\{z_{t_{i}-N_{c}+1},...,z_{t_{i}}\}italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and Zp={zti+1,,zti+Np}subscript𝑍𝑝subscript𝑧subscript𝑡𝑖1subscript𝑧subscript𝑡𝑖subscript𝑁𝑝Z_{p}=\{z_{t_{i}+1},...,z_{t_{i}+N_{p}}\}italic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, separately. The integrated feature ctisubscript𝑐subscript𝑡𝑖c_{t_{i}}italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is calculated by a GRU-based regression encoder garsubscript𝑔𝑎𝑟g_{ar}italic_g start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT that summarizes the information of representations within the context window. ctisubscript𝑐subscript𝑡𝑖c_{t_{i}}italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is used to predict future representations in the prediction window through weight Wk,k[1,Np]subscript𝑊𝑘𝑘1subscript𝑁𝑝W_{k},k\in[1,N_{p}]italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ∈ [ 1 , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ], where Wkctsubscript𝑊𝑘subscript𝑐𝑡W_{k}c_{t}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the prediction for zti+ksubscript𝑧subscript𝑡𝑖𝑘z_{t_{i}+k}italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k end_POSTSUBSCRIPT through the contextual feature ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Positive and negative pairs are then constructed: the predicted representation Wkctisubscript𝑊𝑘subscript𝑐subscript𝑡𝑖W_{k}c_{t_{i}}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT forms positive pairs with the corresponding original representation zti+ksubscript𝑧subscript𝑡𝑖𝑘z_{t_{i}+k}italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k end_POSTSUBSCRIPT, while forming negative pairs with the remaining representations. The loss function is described as follows:

(16) CPC=1||tik=1Nplogexp(s(xti+k,Wkcti))exp(s(xti+k,Wkcti))+jNeexp(s(xj,Wkcti))subscript𝐶𝑃𝐶1subscriptsubscript𝑡𝑖superscriptsubscript𝑘1subscript𝑁𝑝𝑙𝑜𝑔𝑒𝑥𝑝𝑠subscript𝑥subscript𝑡𝑖𝑘subscript𝑊𝑘subscript𝑐subscript𝑡𝑖𝑒𝑥𝑝𝑠subscript𝑥subscript𝑡𝑖𝑘subscript𝑊𝑘subscript𝑐subscript𝑡𝑖subscript𝑗subscript𝑁𝑒𝑒𝑥𝑝𝑠subscript𝑥𝑗subscript𝑊𝑘subscript𝑐subscript𝑡𝑖\mathcal{L}_{CPC}=-\frac{1}{|\mathcal{B}|}\sum_{t_{i}\in\mathcal{B}}\sum_{k=1}% ^{N_{p}}log\frac{exp(s(x_{t_{i}+k},W_{k}c_{t_{i}}))}{exp(s(x_{t_{i}+k},W_{k}c_% {t_{i}}))+\sum_{j\in N_{e}}exp(s(x_{j},W_{k}c_{t_{i}}))}caligraphic_L start_POSTSUBSCRIPT italic_C italic_P italic_C end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_e italic_x italic_p ( italic_s ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_s ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG

where \mathcal{B}caligraphic_B is the sample batch and |||\mathcal{B}|| caligraphic_B | is the batch size. The Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT indexes the negative samples of Wkctisubscript𝑊𝑘subscript𝑐subscript𝑡𝑖W_{k}c_{t_{i}}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where jk+ti𝑗𝑘subscript𝑡𝑖j\neq k+t_{i}italic_j ≠ italic_k + italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By minimizing the contrastive loss, the model can extract invariant temporal features from EEG signals and integrate long-term temporal dependencies within EEG signals to form representations, which can maximize the correlation between representation and EEG raw signal to preserve critical signal information for EEG-based downstream tasks.

EEG-based bidirectional contrastive predictive coding (BCPC) is the extension of CPC to extract bidirectional temporal correlation in EEG signals (Chen et al., 2022b). Unlike CPC, the BCPC method adds an additional backward prediction window in the framework, representing the EEG signal prior to the context window in the time dimension. The contextual feature ctisubscript𝑐subscript𝑡𝑖c_{t_{i}}italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is used to predict the representation in the prediction window and the backward prediction window to construct the positive and negative pairs. By adding the backward prediction window to introduce the reverse negative and positive sample pairs for contrastive learning, the bidirectional model can capture the contextual features with temporal semantic information from both directions in the EEG signal.

5.2. Transformation Contrastive Learning

The transformation contrastive learning method is inspired by the typical contrastive learning framework such as SimCLR (Chen et al., 2020)and MoCo (He et al., 2020) in CV. EEG signals are augmented into negative and positive sample pairs through the signal transformation methods designed according to the characteristics of temporal physiological signals. The framework is shown in Figure 9. Multiple transformation contrastive learning methods have been studied to solve different downstream tasks, the typical frameworks are listed as follows:

Refer to caption
Figure 9. The framework of transformation contrastive method to capture invariant signal temporal features from unlabeled signals.

Signal-transformation contrastive emulates the typical framework SimCLR to conduct EEG contrastive learning (Mohsenvand et al., 2020). For the random selected EEG sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, different transformation methods are employed to generate augmentations T1(xt)subscript𝑇1subscript𝑥𝑡T_{1}(x_{t})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and T2(xt)subscript𝑇2subscript𝑥𝑡T_{2}(x_{t})italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This method leverages the concept that augmentations applied to the same sample yield similar information, forming positive pairs, while augmentations from distinct samples exhibit significant dissimilarity, constituting negative pairs. The encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT generates representations and pretext task decoder gδsubscript𝑔𝛿g_{\delta}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT maps the representation into loss space to calculate the contrastive loss. For batch with size |B|𝐵|B|| italic_B |, the loss function is defined as follows:

(17) =1||t=0||logexp(sim(zt1,zt2)/τ)i=12k𝟙[it]exp(sim(zt1,zi)/τ)1superscriptsubscript𝑡0𝑙𝑜𝑔𝑒𝑥𝑝𝑠𝑖𝑚subscript𝑧subscript𝑡1subscript𝑧subscript𝑡2𝜏superscriptsubscript𝑖12𝑘subscript1delimited-[]𝑖𝑡𝑒𝑥𝑝𝑠𝑖𝑚subscript𝑧subscript𝑡1subscript𝑧𝑖𝜏\mathcal{L}=-\frac{1}{|\mathcal{B}|}\sum_{t=0}^{|\mathcal{B}|}log\frac{exp(sim% (z_{t_{1}},z_{t_{2}})/\tau)}{\sum_{i=1}^{2k}\mathbbm{1}_{[i\neq t]}exp(sim(z_{% t_{1}},z_{i})/\tau)}caligraphic_L = - divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT [ italic_i ≠ italic_t ] end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG

where zt1subscript𝑧subscript𝑡1z_{t_{1}}italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and zt2subscript𝑧subscript𝑡2z_{t_{2}}italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is generated by gδp(fθ(T1(xt)))superscriptsubscript𝑔𝛿𝑝subscript𝑓𝜃subscript𝑇1subscript𝑥𝑡g_{\delta}^{p}(f_{\theta}(T_{1}(x_{t})))italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) and gδp(fθ(T2(xt)))superscriptsubscript𝑔𝛿𝑝subscript𝑓𝜃subscript𝑇2subscript𝑥𝑡g_{\delta}^{p}(f_{\theta}(T_{2}(x_{t})))italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) respectively, indicating the representations of augmentations from the same sample. 𝟙[it]{0,1}subscript1delimited-[]𝑖𝑡01\mathbbm{1}_{[i\neq t]}\in\{0,1\}blackboard_1 start_POSTSUBSCRIPT [ italic_i ≠ italic_t ] end_POSTSUBSCRIPT ∈ { 0 , 1 } is the indicator function, and τ𝜏\tauitalic_τ is the temperature parameter. By minimizing this loss function, the model can optimize the representation space to capture discriminative representations. In the EEG analysis domain, EEG augmentation methods can be applied using signal transformation methods mentioned in Section 3.3, and other EEG signal augmentation methods are listed as follows:

(1) Cutout & resize divides EEG signals into different segments, and one segment is randomly discarded, representing the ”cut out” operation. The remaining segments are then concatenated and resized to the length of the original sample. (2) Crop & resize divides EEG signals into different segments, and one segment is randomly chosen and resized to the length of the original sample. (3) Average filter, regarded as the smoothing operation, replaces some points in the signal with the value of several neighbor points. (4) Amplitude scaling. This method scales the temporal amplitude of the original EEG signal. The scale value should be between 0.5 and 2, suggested by prior research (Mohsenvand et al., 2020). (5) Time shift method shifts the EEG segments along the time dimension, representing the horizontal offset in temporal sampling. (6) Direct-current shift method shifts the EEG segments along the voltage dimension, representing the magnitude offset in temporal sampling. The model can learn invariant EEG features and understand its latent knowledge by conducting contrastive learning on those transformation-augmented EEG signals.

Non-negative EEG contrastive is the contrastive framework without negative samples (Yang et al., 2023). In traditional contrastive learning, the quantity and quality of negative samples play a crucial role in determining the effectiveness and quality of contrastive learning. In this framework, zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent the anchor and its positive samples through augmentation. To reduce the impact of negative pairs, this method proposes the world representation zwsubscript𝑧𝑤z_{w}italic_z start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT representing the average information of EEG signal, where zw=Ekp(.)[zk]z_{w}=E_{k\sim p(.)}[z_{k}]italic_z start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_k ∼ italic_p ( . ) end_POSTSUBSCRIPT [ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] is generated by random representation zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the distribution p()𝑝p(\cdot)italic_p ( ⋅ ). Based on the idea that the similarity between positive pairs should be greater than the similarity between anchor sample and global representation, the loss function is designed as follows:

(18) l(i,j)=s(zi,zw)+ϵs(zi,zj)𝑙𝑖𝑗𝑠subscript𝑧𝑖subscript𝑧𝑤italic-ϵ𝑠subscript𝑧𝑖subscript𝑧𝑗l(i,j)=s(z_{i},z_{w})+\epsilon-s(z_{i},z_{j})italic_l ( italic_i , italic_j ) = italic_s ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + italic_ϵ - italic_s ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

where ϵitalic-ϵ\epsilonitalic_ϵ is the empirical margin, and s(,)𝑠s(\cdot,\cdot)italic_s ( ⋅ , ⋅ ) is the Gaussian kernel to measure the similarity between input representations. By minimizing the loss, the model makes the similarity between the anchor sample and positive sample greater than the world representation to learn consistent EEG information between samples without human labels. Besides, some EEG analysis methods integrated with non-negative contrastive frameworks in CV like Barlow Twins (Zbontar et al., 2021) and BYOL (Grill et al., 2020) to conduct EEG-based non-negative contrastive learning, where all the augmented samples form the positive pairs with anchor sample and the well-designed loss function can extract invariant features from only positive pairs. Those methods are also non-negative contrastive frameworks without global representation.

5.3. Spatial Contrastive Learning

The spatial contrastive learning method shown in Figure 10 focuses on spatial information and utilizes channel-level spatial augmentation techniques (e.g., jigsaw, meiosis) on EEG signals to construct positive and negative sample pairs, from which the model can integrate efficient spatial features and channel correlation into representation. Typical methods are listed as follows:

Spatial shuffle contrastive method conducts the channel-shuffling technique to construct positive and negative pairs (Li et al., 2022a). In this method, EEG signal xc×t𝑥superscript𝑐𝑡x\in\mathbb{R}^{c\times t}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t end_POSTSUPERSCRIPT is augmented through spatial shuffle: different EEG channels are categorized into different brain regions based on their spatial positions, generating XB={X1,X2,,XM}superscript𝑋𝐵superscript𝑋1superscript𝑋2superscript𝑋𝑀X^{B}=\{X^{1},X^{2},...,X^{M}\}italic_X start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT }, where M𝑀Mitalic_M is the number of brain regions and each region in XBsuperscript𝑋𝐵X^{B}italic_X start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT contains features from multiple channels. XBsuperscript𝑋𝐵X^{B}italic_X start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is randomly shuffled and reassembled into the augmented EEG sample X*c×tsuperscript𝑋superscript𝑐𝑡X^{*}\in\mathbb{R}^{c\times t}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t end_POSTSUPERSCRIPT. Each sample generates two shuffling augmentations to form positive sample pairs, and shuffling augmentations from different samples form negative pairs. InfoNCE loss described in equation (16) serves as the loss function for model training. The model can understand relationships between spatial channel location and signal features by contrasting the shuffling augmented EEG samples.

Refer to caption
Figure 10. The framework of spatial contrastive method. In this framework, various channel-level spatial augmentation methods (e.g., channel shuffle, channel meiosis) are used to construct positive and negative sample pairs for contrastive learning, where the model can capture invariant spatial features and channel correlations from unlabeled EEG samples.

Graph contrastive method mines the relationship between channels using the graph structure (Ho and Armanfard, 2023; Ye et al., 2023). In this framework, EEG signals are embedded into node features in the graph, and the edges between nodes are calculated by the channel correlation or the spatial distance. Assuming 𝒢𝒢\mathcal{G}caligraphic_G is the generated graph, 𝒱𝒱\mathcal{V}caligraphic_V is the node-set, and \mathcal{E}caligraphic_E is the edge set, two augmentation methods are employed for contrastive learning: (1) Node drop** method. For a sample 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, two augmented samples 𝒢t1subscriptsuperscript𝒢1𝑡\mathcal{G}^{1}_{t}caligraphic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒢t2subscriptsuperscript𝒢2𝑡\mathcal{G}^{2}_{t}caligraphic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are generated by randomly drop** nodes and their edges according to the drop** rate r%percent𝑟r\%italic_r %. The augmentations from the same sample form the positive pairs, and from different samples form the negative pairs. (2) Sub-graph augmentation. For each node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the sample 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, two positive and one negative samples are constructed: The random walk with restart algorithm is used to generate positive sub-graph 𝒢i,1+subscriptsuperscript𝒢𝑖1\mathcal{G}^{+}_{i,1}caligraphic_G start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and 𝒢i,2+subscriptsuperscript𝒢𝑖2\mathcal{G}^{+}_{i,2}caligraphic_G start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT centered at the selected node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the radius parameter ra𝑟𝑎raitalic_r italic_a to control the size, and generates negative sub-graph 𝒢isubscriptsuperscript𝒢𝑖\mathcal{G}^{-}_{i}caligraphic_G start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT centered at the farthest node from the selected node. In the positive sub-graph, the features of target nodes are masked with zero to avoid interference from target node information. For different sub-graphs, the representations are encoded by the trainable weight Wesubscript𝑊𝑒W_{e}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT through the GNN, where rei,1+𝑟subscriptsuperscript𝑒𝑖1re^{+}_{i,1}italic_r italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and rei,2+𝑟subscriptsuperscript𝑒𝑖2re^{+}_{i,2}italic_r italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT are the representations for positive samples, and rei,1𝑟subscriptsuperscript𝑒𝑖1re^{-}_{i,1}italic_r italic_e start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT is the representation of negative sample. The embedding of the selected node is also calculated by Wesubscript𝑊𝑒W_{e}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where ei=ReLU(viWe)subscript𝑒𝑖𝑅𝑒𝐿𝑈subscript𝑣𝑖subscript𝑊𝑒e_{i}=ReLU(v_{i}W_{e})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ). A trainable score matrix Wssubscript𝑊𝑠W_{s}italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is then designed to quantify the similarity between the selected node and its sub-graphs, which is described as follows:

(19) Si,j+=σ(eiWsrei,j+)subscriptsuperscript𝑆𝑖𝑗𝜎subscript𝑒𝑖subscript𝑊𝑠𝑟subscriptsuperscript𝑒𝑖𝑗S^{+}_{i,j}=\sigma(e_{i}W_{s}{re}^{+}_{i,j})italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_r italic_e start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT )

where σ𝜎\sigmaitalic_σ represents the logistic function. The contrastive loss is designed to maximize the correlation between the embedding of nodes and positive samples, which makes the representation of channels in latent space closer to similar channels. The loss function is defined as:

(20) =12c||j=12i=1c(log(Si,j++log(1Si,1)))12𝑐superscriptsubscript𝑗12superscriptsubscript𝑖1𝑐𝑙𝑜𝑔subscriptsuperscript𝑆𝑖𝑗𝑙𝑜𝑔1subscriptsuperscript𝑆𝑖1\mathcal{L}=-\frac{1}{2c|\mathcal{B}|}\sum_{j=1}^{2}\sum_{i=1}^{c}(log(S^{+}_{% i,j}+log(1-S^{-}_{i,1})))caligraphic_L = - divide start_ARG 1 end_ARG start_ARG 2 italic_c | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_l italic_o italic_g ( italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_l italic_o italic_g ( 1 - italic_S start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ) ) )

where c𝑐citalic_c is the number of channels and |||\mathcal{B}|| caligraphic_B | is the batch size. By minimizing this loss, the model focuses on channel-level spatial features, which confers the model with the robust ability to comprehend high spatial resolution in EEG, leading to superior performance in downstream tasks involving multiple channels and complex channel configurations.

EEG meiosis contrastive method conducts meiosis augmentation technique and contrastive learning framework to integrate invariant channel features into representation (Guo et al., 2023). The meiosis data augmentation technique is used to generate contrastive pairs: two EEG samples are randomly sampled into the group Xig={Ai,Bi}subscriptsuperscript𝑋𝑔𝑖subscript𝐴𝑖subscript𝐵𝑖X^{g}_{i}=\{A_{i},B_{i}\}italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where the format of samples Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are c×tsuperscript𝑐𝑡\mathbb{R}^{c\times t}blackboard_R start_POSTSUPERSCRIPT italic_c × italic_t end_POSTSUPERSCRIPT, c𝑐citalic_c is the channel number and t𝑡titalic_t is the number of sampling points. For the group, Ai={a1,a2,a3,,at}subscript𝐴𝑖subscript𝑎1subscript𝑎2subscript𝑎3subscript𝑎𝑡A_{i}=\{a_{1},a_{2},a_{3},...,a_{t}\}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and Bi={b1,b2,b3,,bt}subscript𝐵𝑖subscript𝑏1subscript𝑏2subscript𝑏3subscript𝑏𝑡B_{i}=\{b_{1},b_{2},b_{3},...,b_{t}\}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } are engaged into meiosis with each other, which signifies data exchange between Apsuperscript𝐴𝑝A^{p}italic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Bpsuperscript𝐵𝑝B^{p}italic_B start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, generating the augmented sample Vi1={a1,a2,a3,,ai,bi+1,bi+2,,bc}subscriptsuperscript𝑉1𝑖subscript𝑎1subscript𝑎2subscript𝑎3subscript𝑎𝑖subscript𝑏𝑖1subscript𝑏𝑖2subscript𝑏𝑐V^{1}_{i}=\{a_{1},a_{2},a_{3},...,a_{i},b_{i+1},b_{i+2},...,b_{c}\}italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } and Vi2={b1,b2,b3,,bi,ai+1,ai+2,,ac}subscriptsuperscript𝑉2𝑖subscript𝑏1subscript𝑏2subscript𝑏3subscript𝑏𝑖subscript𝑎𝑖1subscript𝑎𝑖2subscript𝑎𝑐V^{2}_{i}=\{b_{1},b_{2},b_{3},...,b_{i},a_{i+1},a_{i+2},...,a_{c}\}italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }. EEG samples Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are under the same stimulus/event to increase contrasting complexity. All training samples are augmented and transformed into Vi1subscriptsuperscript𝑉1𝑖V^{1}_{i}italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Vi2subscriptsuperscript𝑉2𝑖V^{2}_{i}italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for contrast, and the sample feature representation can be generated through encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and projector (pretext task decoder) gδpsuperscriptsubscript𝑔𝛿𝑝g_{\delta}^{p}italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT by zi1=gδp(fθ(Vi1))subscriptsuperscript𝑧1𝑖subscriptsuperscript𝑔𝑝𝛿subscript𝑓𝜃subscriptsuperscript𝑉1𝑖z^{1}_{i}=g^{p}_{\delta}(f_{\theta}(V^{1}_{i}))italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). In the framework, zi1subscriptsuperscript𝑧1𝑖z^{1}_{i}italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zi2subscriptsuperscript𝑧2𝑖z^{2}_{i}italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT form the positive pair, indicating samples exchanged EEG signal with each other, while zi1subscriptsuperscript𝑧1𝑖z^{1}_{i}italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zj2subscriptsuperscript𝑧2𝑗z^{2}_{j}italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT form the negative pair (ij𝑖𝑗i\neq jitalic_i ≠ italic_j). The loss function is then defined as follows:

(21) L=12(1||i=0||logexp(s(zi1,zi2)/τ)j=0||𝟙[ji](s(zi1,zj1)/τ)+j=0|B|(s(zi1,zj2)/τ)\displaystyle L=-\frac{1}{2}(\frac{1}{|\mathcal{B}|}\sum_{i=0}^{|\mathcal{B}|}% log\frac{exp(s(z^{1}_{i},z^{2}_{i})/\tau)}{\sum_{j=0}^{|\mathcal{B}|}\mathbbm{% 1}_{[j\neq i]}(s(z^{1}_{i},z^{1}_{j})/\tau)+\sum_{j=0}^{|B|}(s(z^{1}_{i},z^{2}% _{j})/\tau)}italic_L = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT ( italic_s ( italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_B | end_POSTSUPERSCRIPT ( italic_s ( italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG
+1||i=0||logexp(s(zi1,zi2)/τ)j=0||𝟙[ji](s(zi2,zj2)/τ)+j=0||(s(zj1,zi2)/τ))\displaystyle+\frac{1}{|\mathcal{B}|}\sum_{i=0}^{|\mathcal{B}|}log\frac{exp(s(% z^{1}_{i},z^{2}_{i})/\tau)}{\sum_{j=0}^{|\mathcal{B}|}\mathbbm{1}_{[j\neq i]}(% s(z^{2}_{i},z^{2}_{j})/\tau)+\sum_{j=0}^{|\mathcal{B}|}(s(z^{1}_{j},z^{2}_{i})% /\tau)})+ divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT ( italic_s ( italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT ( italic_s ( italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG )

where s(,)𝑠s(\cdot,\cdot)italic_s ( ⋅ , ⋅ ) is the function to measure the similarity between representations, and 𝟙[ji]{0,1}subscript1delimited-[]𝑗𝑖01\mathbbm{1}_{[j\neq i]}\in\{0,1\}blackboard_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT ∈ { 0 , 1 } is the indicator function that equals 0 when i=j𝑖𝑗i=jitalic_i = italic_j. The proposed contrastive loss aims to minimize the distance between mutually coupled sample pairs (Vi1,Vi2)subscriptsuperscript𝑉1𝑖subscriptsuperscript𝑉2𝑖(V^{1}_{i},V^{2}_{i})( italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and maximize the distance between other sample pairs without mutual coupling:(Vi1,Vj1)subscriptsuperscript𝑉1𝑖subscriptsuperscript𝑉1𝑗(V^{1}_{i},V^{1}_{j})( italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), (Vi2,Vj2)subscriptsuperscript𝑉2𝑖subscriptsuperscript𝑉2𝑗(V^{2}_{i},V^{2}_{j})( italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), and (Vi1,Vj2)subscriptsuperscript𝑉1𝑖subscriptsuperscript𝑉2𝑗(V^{1}_{i},V^{2}_{j})( italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where ij𝑖𝑗i\neq jitalic_i ≠ italic_j. By minimizing the loss function, the model is trained to comprehend specific and coherent channel features and can discriminate homologous EEG channel data, which can be regarded as the model capturing the EEG channel distribution knowledge, proving highly beneficial for EEG-based tasks.

5.4. Composite Contrastive Learning

Composite contrastive learning is the complex framework that augments EEG signals in multiple views or domains and conducts cross-view and cross-domain and contrastive learning to extract more expressive and complex representations integrating specific signal knowledge. Figure 11 shows an example of the typical framework, and existing composite EEG contrastive learning frameworks are listed as follows:

Frequency-temporal contrastive method conducts contrastive learning on temporal and frequency domain. Two different frequency-temporal contrastive strategies have been investigated:

(1) Complementary strategy conducts cross-view contrastive learning to avoid the ignorance of complementary information in different views (Kumar et al., 2022). EEG signal xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is augmented into xi,1subscript𝑥𝑖1x_{i,1}italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and xi,2subscript𝑥𝑖2x_{i,2}italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT through signal transformations, which are then mapped into the temporal and spectral domain independently, generating temporal components xi,1tsubscriptsuperscript𝑥𝑡𝑖1x^{t}_{i,1}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, xi,1tsubscriptsuperscript𝑥𝑡𝑖1x^{t}_{i,1}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and spectrum components xi,1ssubscriptsuperscript𝑥𝑠𝑖1x^{s}_{i,1}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, xi,2ssubscriptsuperscript𝑥𝑠𝑖2x^{s}_{i,2}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT. Different augmentations are subsequently processed through the temporal encoder fθtsuperscriptsubscript𝑓𝜃𝑡f_{\theta}^{t}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and spectrum encoder fθssuperscriptsubscript𝑓𝜃𝑠f_{\theta}^{s}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to construct the representations zi,1ssubscriptsuperscript𝑧𝑠𝑖1z^{s}_{i,1}italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and zi,1tsubscriptsuperscript𝑧𝑡𝑖1z^{t}_{i,1}italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT from augmentation xi,1subscript𝑥𝑖1x_{i,1}italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, and zi,2ssubscriptsuperscript𝑧𝑠𝑖2z^{s}_{i,2}italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT and zi,2tsubscriptsuperscript𝑧𝑡𝑖2z^{t}_{i,2}italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT from augmentation xi,2subscript𝑥𝑖2x_{i,2}italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT. Four losses are combined to train the model: 1. temporal contrastive loss, denoted as

Refer to caption
Figure 11. An example of the composite contrastive learning framework (temporal-frequency contrastive method).

ttsubscript𝑡𝑡\mathcal{L}_{tt}caligraphic_L start_POSTSUBSCRIPT italic_t italic_t end_POSTSUBSCRIPT. The temporal representations generated from same augmentation form positive pairs {zi,1t,zi,2t}subscriptsuperscript𝑧𝑡𝑖1subscriptsuperscript𝑧𝑡𝑖2\{z^{t}_{i,1},z^{t}_{i,2}\}{ italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT }, and from different augmentation form negative pairs {zi,1t,zi,2t},ijsubscriptsuperscript𝑧𝑡𝑖1subscriptsuperscript𝑧𝑡𝑖2𝑖𝑗\{z^{t}_{i,1},z^{t}_{i,2}\},i\neq j{ italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT } , italic_i ≠ italic_j, with the infoNCE serves as the loss function. 2. Spectrum contrastive loss, denoted as sssubscript𝑠𝑠\mathcal{L}_{ss}caligraphic_L start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT calculated by spectral augmented representations similar to temporal contrastive loss. 3. Mixing contrastive loss, denoted as ggsubscript𝑔𝑔\mathcal{L}_{gg}caligraphic_L start_POSTSUBSCRIPT italic_g italic_g end_POSTSUBSCRIPT. The spectrum and temporal augmented representations are concatenated to form the mixing augmented representation, where zi,1g=cat(zi,1t,zi,1s)subscriptsuperscript𝑧𝑔𝑖1𝑐𝑎𝑡subscriptsuperscript𝑧𝑡𝑖1subscriptsuperscript𝑧𝑠𝑖1z^{g}_{i,1}=cat(z^{t}_{i,1},z^{s}_{i,1})italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT = italic_c italic_a italic_t ( italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ), cat𝑐𝑎𝑡catitalic_c italic_a italic_t represents the concatenation operation. This loss can be calculated similarly to the first two losses. 4. Complementary loss, denoted as dsubscript𝑑\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The above losses may narrow the distance between representations, losing complementary features in each view. Therefore, the complementary loss is designed to pull corresponding augmented samples in the same view closer while pushing away the corresponding augmented samples in different views. Assuming zi={zi,1t,zi,2t,zi,1s,zi,2t}subscript𝑧𝑖subscriptsuperscript𝑧𝑡𝑖1subscriptsuperscript𝑧𝑡𝑖2subscriptsuperscript𝑧𝑠𝑖1subscriptsuperscript𝑧𝑡𝑖2z_{i}=\{z^{t}_{i,1},z^{t}_{i,2},z^{s}_{i,1},z^{t}_{i,2}\}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT }, complementary loss is defined as:

(22) {ld(zi,j,k)=logexp(s(zi[j],zi[k])/τ)q=14𝟙[qj]exp(s(zi[j],zi[q])/τ)D=14||i=0||ld(zi,1,2)+ld(zi,2,1)+ld(zi,3,4)+ld(zi,4,3)\displaystyle\left\{\begin{aligned} &l_{d}(z_{i},j,k)=-log\frac{exp(s(z_{i}[j]% ,z_{i}[k])/\tau)}{\sum_{q=1}^{4}\mathbbm{1}_{[q\neq j]}exp(s(z_{i}[j],z_{i}[q]% )/\tau)}\\ &\mathcal{L}_{D}=\frac{1}{4|\mathcal{B}|}\sum_{i=0}^{|\mathcal{B}|}l_{d}(z_{i}% ,1,2)+l_{d}(z_{i},2,1)+l_{d}(z_{i},3,4)+l_{d}(z_{i},4,3)\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j , italic_k ) = - italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_j ] , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT [ italic_q ≠ italic_j ] end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_s ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_j ] , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_q ] ) / italic_τ ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 4 | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 , 2 ) + italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 , 1 ) + italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 3 , 4 ) + italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 4 , 3 ) end_CELL end_ROW

where s()𝑠s()italic_s ( ) is the similarity function. Multiple loss functions are combined to train the model to extract multi-domain features and preserve the domain-specific and complementary features, which can be described as follows:

(23) con=λ1(tt+ss+gg)+λ2Dsubscript𝑐𝑜𝑛subscript𝜆1subscript𝑡𝑡subscript𝑠𝑠subscript𝑔𝑔subscript𝜆2subscript𝐷\mathcal{L}_{con}=\lambda_{1}(\mathcal{L}_{tt}+\mathcal{L}_{ss}+\mathcal{L}_{% gg})+\lambda_{2}\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_t italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_g italic_g end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the hyperparameters to balance different contrastive losses.

(2) Consistent strategy focuses on extracting consistent information between temporal and frequency representations through contrastive learning (Zhang et al., 2022c). Different from the complementary strategy, this strategy aims to maximize the mutual information between temporal and frequency representation to align different representations in a latent feature space to extract multi-domain coherent features. The consistent loss function is described as follows:

(24) c=1||i=1||Sim*(Simt1,f1Sim*+δ),Sim*{Simt1,f2,Simt2,f1,Simt2,f2}formulae-sequencesubscript𝑐1superscriptsubscript𝑖1subscript𝑆𝑖superscript𝑚𝑆𝑖superscript𝑚subscript𝑡1subscript𝑓1𝑆𝑖superscript𝑚𝛿𝑆𝑖superscript𝑚𝑆𝑖superscript𝑚subscript𝑡1subscript𝑓2𝑆𝑖superscript𝑚subscript𝑡2subscript𝑓1𝑆𝑖superscript𝑚subscript𝑡2subscript𝑓2\mathcal{L}_{c}=\frac{1}{|\mathcal{B}|}\sum_{i=1}^{|\mathcal{B}|}\sum_{Sim^{*}% }(Sim^{t_{1},f_{1}}-Sim^{*}+\delta),Sim^{*}\in\{Sim^{t_{1},f_{2}},Sim^{t_{2},f% _{1}},Sim^{t_{2},f_{2}}\}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_S italic_i italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_S italic_i italic_m start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_S italic_i italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_δ ) , italic_S italic_i italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ { italic_S italic_i italic_m start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_S italic_i italic_m start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_S italic_i italic_m start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }

where δ𝛿\deltaitalic_δ is the hyperparameter. In this loss function, the Simit1,f1=d(zi,1t,zi,1f)𝑆𝑖subscriptsuperscript𝑚subscript𝑡1subscript𝑓1𝑖𝑑subscriptsuperscript𝑧𝑡𝑖1subscriptsuperscript𝑧𝑓𝑖1Sim^{t_{1},f_{1}}_{i}=d(z^{t}_{i,1},z^{f}_{i,1})italic_S italic_i italic_m start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d ( italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ) is defined to measure the representation similarity, where zi,1tsubscriptsuperscript𝑧𝑡𝑖1z^{t}_{i,1}italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT and zi,1fsubscriptsuperscript𝑧𝑓𝑖1z^{f}_{i,1}italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT are the temporal and frequency representations generated by sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and zi,2tsubscriptsuperscript𝑧𝑡𝑖2z^{t}_{i,2}italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT and zi,2fsubscriptsuperscript𝑧𝑓𝑖2z^{f}_{i,2}italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT are generated by augmented sample xi*subscriptsuperscript𝑥𝑖x^{*}_{i}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By minimizing this loss, the frequency and temporal representations can be pulled closer for a sample in the latent space to mine for multi-domain consistent features.

Multi-view CPC extends CPC from the single view to multiple views for exploring complex EEG features (Eldele et al., 2021). In this method, the weak and strong augmentation methods are designed to construct two views: jitter-and-scale strategy is used to construct weak augmentations xi1subscriptsuperscript𝑥1𝑖x^{1}_{i}italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while permutation-and-jitter generates complex strong augmentation xi2subscriptsuperscript𝑥2𝑖x^{2}_{i}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. According to the definition of CPC, the integrated features ci1subscriptsuperscript𝑐1𝑖c^{1}_{i}italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ci2subscriptsuperscript𝑐2𝑖c^{2}_{i}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of context windows from two views are generated. Cross-view prediction strategy is implemented, where ci1subscriptsuperscript𝑐1𝑖c^{1}_{i}italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is used to predict future windows zi+12subscriptsuperscript𝑧2𝑖1z^{2}_{i+1}italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and ci2subscriptsuperscript𝑐2𝑖c^{2}_{i}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is used to predict future windows zi+11subscriptsuperscript𝑧1𝑖1z^{1}_{i+1}italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, generating CPC losses 1,2subscript12\mathcal{L}_{1,2}caligraphic_L start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT and 2,1subscript21\mathcal{L}_{2,1}caligraphic_L start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT. Besides, the cross-view contextual contrastive strategy is designed to extract discriminative features: ci1superscriptsubscript𝑐𝑖1c_{i}^{1}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ci2superscriptsubscript𝑐𝑖2c_{i}^{2}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT generated from the same sample but in different views form the positive pairs, while other representations form negative pairs. For the samples in batch \mathcal{B}caligraphic_B, one given sample can construct 1 positive pair and 2||2222|\mathcal{B}|-22 | caligraphic_B | - 2 negative pairs. The loss function is defined as follows:

(25) CC=i=1||exp(s(ci1,ci2)/τ)j=1||q=12𝟙[ji]exp(s(ci1,cjq)/τ)subscript𝐶𝐶superscriptsubscript𝑖1𝑒𝑥𝑝𝑠superscriptsubscript𝑐𝑖1superscriptsubscript𝑐𝑖2𝜏superscriptsubscript𝑗1superscriptsubscript𝑞12subscript1delimited-[]𝑗𝑖𝑒𝑥𝑝𝑠superscriptsubscript𝑐𝑖1superscriptsubscript𝑐𝑗𝑞𝜏\mathcal{L}_{CC}=-\sum_{i=1}^{|\mathcal{B}|}\frac{exp(s(c_{i}^{1},c_{i}^{2})/% \tau)}{\sum_{j=1}^{|\mathcal{B}|}\sum_{q=1}^{2}\mathbbm{1}_{[j\neq i]}exp(s(c_% {i}^{1},c_{j}^{q})/\tau)}caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_s ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG

Different loss functions are combined as =λ1(1,2+2,1)+λ2CCsubscript𝜆1subscript12subscript21subscript𝜆2subscript𝐶𝐶\mathcal{L}=\lambda_{1}(\mathcal{L}_{1,2}+\mathcal{L}_{2,1})+\lambda_{2}% \mathcal{L}_{CC}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_C end_POSTSUBSCRIPT through the weight λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to balance different losses. The multi-view CPC method can mine complex temporal features and understand the aligned representation.

Multi-level contrastive method conducts contrastive learning at multiple levels to capture complex signal features (Zhang et al., 2022a). In this framework, EEG sample is divided into nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT segments: Xi={x1,x2,x3,,x(nl)}subscript𝑋𝑖subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥subscript𝑛𝑙X_{i}=\{x_{1},x_{2},x_{3},...,x_{(n_{l})}\}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT }, with CNN encoder fθ1superscriptsubscript𝑓𝜃1f_{\theta}^{1}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT to generate local representation Zi={z1,z2,z3,,znl}subscript𝑍𝑖subscript𝑧1subscript𝑧2subscript𝑧3subscript𝑧subscript𝑛𝑙Z_{i}=\{z_{1},z_{2},z_{3},...,z_{n_{l}}\}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and transformer encoder fθ2superscriptsubscript𝑓𝜃2f_{\theta}^{2}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to generate contextual representation Ri={r1,r1+k,r1+2k}subscript𝑅𝑖subscript𝑟1subscript𝑟1𝑘subscript𝑟12𝑘R_{i}=\{r_{1},r_{1+k},r_{1+2k}...\}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 + italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 + 2 italic_k end_POSTSUBSCRIPT … }, where contextual representation rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is generated by the integration of k𝑘kitalic_k neighbor local representations. The positive and negative sample pairs are constructed according to the filter with EEG rhythm rules. InfoNCE loss is used at local and contextual levels to extract multi-granular features. The fusion of multi-level contrastive learning can integrate different aspects of temporal features of EEG signals into representation, making representation more efficient and expressive.

Scalp-dipole neural contrastive is a knowledge-based cross-view contrastive method to generate general neural representation (Weng et al., 2023). In this framework, two views are constructed according to the neural source (Jackson and Bolger, 2014) of EEG signal xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: the scalp view sci𝑠subscript𝑐𝑖sc_{i}italic_s italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is constructed by spatial matrix, indicating the distribution of EEG voltage across the scalp; the dipole view dpi𝑑subscript𝑝𝑖dp_{i}italic_d italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is constructed by undirected graph, indicating the inner correlation of dipoles (activated pyramidal cells) that produce the EEG signals. EEG signals are augmented through mask and jigsaw also transformed into two views. The CNN encoder fθcsuperscriptsubscript𝑓𝜃𝑐f_{\theta}^{c}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and graph convolutional encoder fθgsuperscriptsubscript𝑓𝜃𝑔f_{\theta}^{g}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT are designed to generate scalp and dipole representation from different views, and two contrastive

Refer to caption
Figure 12. An example of task-oriented EEG contrastive learning: where the contrastive sample pairs are constructed from EEG signal and image data to align the features of EEG signal with image feature.

strategies are proposed: (1) Inner-view contrastive aims to extract invariant features in each view. Augmented samples in a specific view are considered as positive pairs and the Barlow twins loss is implemented to minimize their distance in the representation space and capture invariant information between augmented samples; (2) Cross-view contrastive is based on the theory that EEG representations for different views are homogenous and contain similar neural information. Therefore, augmentations in two views generated by the same sample construct positive pairs and augmentations in two views construct negative pairs, with InfoNCE loss to train the model. Combining the inner-view and cross-view losses can extract both view-specific features and latent neural information, generating general representations effective for different EEG-based tasks.

5.5. Task-oriented EEG Contrastive Learning

Task-oriented contrastive learning is an idiosyncratic framework set up to solve specific tasks. Some specific contrastive frameworks are designed for particular tasks: (1) Image-EEG contrastive(Song et al., 2023), where the image and the corresponding EEG signals elicited by viewing this image can form positive pairs, and the image with other EEG signals can form negative pairs. The process of image-EEG contrastive learning can be shown in Figure 12. The mutual information between the image and its corresponding EEG signal is maximized by image-EEG contrastive, which can solve the task of EEG image decoding. (2) Speech-EEG contrastive learning is inspired by the contrastive language-image pre-training (CLIP)(Radford et al., 2021) method that forms the EEG-speech sample pairs to extract correlation and solve the EEG speech decoding task (Défossez et al., 2023). (3) Cross-subject contrastive learning aims to address the individual variability issue in EEG signals. For example, age contrastive learning selects anchor samples at different age groups, and samples with a slight difference in age from the anchor sample are used as positive samples to construct positive pairs, while those with a large difference in age from them are used to form positive pairs. By contrasting negative and positive pairs, the model can capture the age-related brain features to improve the generalizability of generated representations (Wagh et al., 2021).

Table 3. The summarization of contrastive-based EEG analysis SSL method. ”CPC” represents contrastive predictive coding, ”CL” represents contrastive learning, ”Tfm” represents the transformer model, ”PT” represents pre-training and fine-tune mode, ”UT” represents unsupervised training mode, and ”CT” represents joint-training mode.

Approach

Sub-category

Detailed method

Backbone

Downstream Tasks

Training Mode

Clinical EEG SSL (Banville et al., 2021)

Contrastive predictive coding

EEG-based CPC

CNN

Sleep & pathology classification

PT

SSL for EEG (Banville et al., 2019)

Contrastive predictive coding

EEG-based CPC

CNN

Sleep classification

PT

ContrastWR(Yang et al., 2023)

Transformation-based contrastive

Non-negative CL

CNN

Sleep & pathology classification

UT

SSCL for EEG(Jiang et al., 2021)

Transformation-based contrastive

Signal transformation CL

CNN

Sleep classification

PT&UT

EEG-CGS(Ho and Armanfard, 2023)

Spatial contrastive

Spatial shuffle CL

GNN

Seizure analysis

PT

GMSS (Li et al., 2022a)

Spatial contrastive

Graph-based CL

GNN

Emotion recognition

PT&UT&JT

SleepDPC (Xiao et al., 2021)

Contrastive predictive coding

EEG-based CPC

CNN&LSTM

Sleep classification

UT

Seq-SimCLR (Mohsenvand et al., 2020)

Transformation-based contrastive

Signal transformation CL

CNN&GRU

Multiple tasks

PT&UT

Domain-guide CL(Wagh et al., 2021)

Task-oriented contrastive

Cross-subject CL

CNN

Multiple tasks

PT

Multivariate CL(Brüsch et al., 2023)

Spatial contrastive

Graph-based CL

GNN

Sleep classification

PT

DS-AGC(Ye et al., 2023)

Spatial contrastive

Graph-based CL

GNN

Emotion recognition

PT

ME-MHAC(Guo et al., 2023)

Spatial contrastive

Meiosis-based CL

CNN

Emotion recognition

PT

MBrain(Cai et al., 2023)

Contrastive predictive coding

EEG-based CPC

CNN&LSTM

Seizure detection

PT

BrainNet(Chen et al., 2022b)

Contrastive predictive coding

Bidirectional CPC

GNN

Seizure detection

JT

SleepECL(Zhang et al., 2022a)

Composite contrastive

Multi-level CL

Transformer

Sleep classification

UT

TS-TCC(Eldele et al., 2021)

Composite contrastive

Multi-view CPC

Transformer

Sleep & seizure detection

PT&UT

CoSleep(Ye et al., 2021)

Contrastive predictive coding

EEG-based CPC

CNN

Sleep classification

UT

SLAM-EEG(Xiao et al., 2024)

Transformation-based contrastive

Transformation-based CL

ViT

Seizure detection

PT

SPP-EEGNET(Li and Metsis, 2022)

Transformation-based contrastive

Transformation-based CL

CNN

Multiple tasks

PT

DSSNet(Chang et al., 2022)

Contrastive predictive coding

EEG-based CPC

CNN&RNN

Sleep classification

UT

TF-C(Zhang et al., 2022c)

Composite contrastive

Frequency-temporal CL

CNN&Tfm

Sleep classification

UT

TS-MoCo(Hallgarten et al., 2023)

Transformation-based contrastive

Transformation-based CL

Transformer

Emotion recognition

PT&UT

MV-EEG(Hojjati, 2023)

Composite contrastive

Frequency-temporal CL

Transformer

Pathology detection

UT

PSN-Sleep(You et al., 2023)

Transformation-based contrastive

Non-negative CL

CNN

Sleep classification

UT

MulEEG(Kumar et al., 2022)

Composite contrastive

Frequency-temporal CL

CNN

Sleep classification

UT

MI-SSLEEG(Han et al., 2021)

Transformation-based contrastive

Transformation-based CL

CNN

Motor imagery

JT

SA-EEG(Cheng et al., 2020)

Transformation-based contrastive

Transformation-based CL

CNN

Motor imagery

UT

MtCLSS(Wang et al., 2023)

Transformation-based contrastive

Transformation-based CL

CNN

Sleep classification

UT

Multi-channel CL(Gao et al., [n. d.])

Transformation-based contrastive

Transformation-based CL

CNN

Sleep & pathology classification

UT

SGMC (Kan et al., 2023)

Spatial contrastive

Meiosis-based CL

CNN

Emotion recognition

PT

CLISA (Shen et al., 2022)

Task-oriented contrastive

Cross-subject CL

CNN

Emotion recognition

UT

KDC(Weng et al., 2023)

Composite contrastive

Scalp-dipole neural CL

CNN&GNN

Multiple tasks

PT&UT

NICE-EEG(Song et al., 2023)

Task-oriented contrastive

Image-signal CL

ViT&GNN

Image-decoding

UT

AAD(Défossez et al., 2023)

Task-oriented contrastive

Speech-signal CL

CNN&LSTM

Speech-decoding

UT

5.6. Section Discussion

In this section, various contrastive-based EEG analysis methods are comprehensively reviewed. The contrastive-based frameworks are categorized into five sub-categories: 1. contrastive predictive coding method that integrates the prediction and contrastive tasks to capture temporal information. 2. Transformation contrastive learning to extract signal-related invariant features. 3. Spatial contrastive method to capture spatial channel correlation. 4. Composite contrastive method that conducts multi-view contrastive learning to extract spatial-temporal-spectral features. 5. Task-oriented contrastive method that constructs specialized framework towards specific tasks. Compared to other SSL for EEG analysis, contrastive-based tasks are the most effective, with fewer parameters and simpler tasks to generate representations with higher generation and information density. Contrastive methods rely on the augmentation techniques, where the well-designed sample pairs can help the model integrate critical neural knowledge and arbitrarily chosen sample pairs may yield counterproductive results.

6. hybrid SSL EEG analysis method

The hybrid SSL EEG analysis method combines various pretext tasks to jointly train the model to learn complex knowledge or information. The idea of multi-task learning(Zhang and Yang, 2021; Kendall et al., 2018) has been applied in hybrid SSL methods: the common encoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is used to extract features and generate representation from EEG signal, with different pretext task decoders {gδp1,gδp2,}superscriptsubscript𝑔𝛿subscript𝑝1superscriptsubscript𝑔𝛿subscript𝑝2\{g_{\delta}^{p_{1}},g_{\delta}^{p_{2}},...\}{ italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … } are used to solve multiple pretext tasks. The losses from different tasks are fused to train the model, where the shared encoder can fully leverage the advantages of different tasks to obtain representation that encompasses more knowledge and exhibits stronger expressive capabilities. The combination of multi-task losses with weight λ𝜆\lambdaitalic_λ can be described as follows:

(26) mt=i=1tnλiisubscript𝑚𝑡superscriptsubscript𝑖1subscript𝑡𝑛subscript𝜆𝑖subscript𝑖\mathcal{L}_{mt}=\sum_{i=1}^{t_{n}}\lambda_{i}\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_m italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Tabel 4 shows the existing hybrid EEG SSL methods. In the existing studies, different combinations of pretext tasks are used to generate representations: many methods combine the predictive and contrastive tasks (Banville et al., 2021, 2019), where the decoders predict the transformations and conduct negative and positive pairs for contrastive learning to capture critical discriminative information and invariant features; Another method combines the generative and contrastive tasks(Ho and Armanfard, 2023) to explore the local correlations and global coherent features of EEG signal. Although the hybrid SSL method can capture complex features through multiple tasks, the gradient interference caused by training various tasks may influence the effectiveness of the generated representation. Therefore, this method requires careful selection of correlated pretext tasks to avoid interference between tasks.

Table 4. The summarization of hybrid-based EEG analysis self-supervised learning method.

Approach
Pretext-category Backbone Downstream Tasks Training Mode
Clinical EEG SSL(Banville et al., 2021) Predictive task/contrastive task CNN Sleep and pathology classification PT
SSL for sleep EEG (Banville et al., 2019) Predictive task/contrastive task CNN Sleep and pathology classification PT
EEG-CGS (Ho and Armanfard, 2023) Generative task/contrastive task CNN Seizure analysis PT
GMSS(Li et al., 2022a) Predictive task/contrastive task GNN Emotion recognition PT&UT
MBrain(Cai et al., 2023) Predictive task/contrastive task CNN&LSTM Seizure detection PT
MtCLSS(Li et al., 2022b) Predictive task/contrastive task CNN Sleep classification UT

7. Practical downstream tasks

SSL EEG analysis methods have been applied to various EEG-based tasks. Table 5 demonstrates the EEG-based downstream tasks and related datasets, the practical downstream tasks are listed as follows:

Emotion recognition is the task aims to decode emotional states from EEG signals collected by non-invasive electrodes. The traditional emotion recognition method combines machine learning with hand-crafted features to predict discrete emotions from EEG, while the recent emotion recognition method conducts end-to-end deep models to capture continuous emotion scores (Weng et al., 2022). The labels of the training samples are derived from the subjective rating scales or the type of stimuli that elicited the signals, which may introduce significant bias into the model training process. By combining SSL with emotion recognition task, the issue of label shift can be mitigated and the representation can improve task performance in the low-label scenarios.

Motor imagery is the task to decode the mental simulation without physically performing the movement(Tangermann et al., 2012). This task involves mentally rehearsing or imaging a specific motor action, such as imaging moving the left limb, right limb or executing a complex physical activity. The decoded imagined patterns can be applied as the control signal in the brain-computer interface (BCI). For example, controlling the exoskeleton for the disabled (Choi et al., 2020). EEG-based motor imagery recognition methods have been widely investigated. The challenges in motor imagery lie in difficult labeling and significant subject variability, which can be effectively addressed by combining different pretext tasks in SSL.

Pathology detection is the most crucial clinical tasks for EEG-based applications. This task aims to recognize the mental or neural diseases that occur in the brain from EEG signals. Deep models are used to detect seizure, autism spectrum disorder, and other disorders from EEG signals (Chen et al., 2023). However, the clinical applications of EEG signals demand high-density training data and expert knowledge to label the samples, which introduces substantial challenges in data collection. The SSL framework can reduce the number of labeled EEG samples and combine medical knowledge pretext tasks, which holds significance for the development and improvement of EEG-based clinical detection.

Table 5. The summarization of datasets that have been used in SSL EEG analysis, where the symbol ’-’ represents the missing information for the dataset

Dataset
Subject number Sampling rates EEG channels Task Label Auxiliary data
Physionet Challenge 2018 (Ghassemi et al., 2018; Goldberger et al., 2000) 1983 200 Hz 6 Sleep classification Weak,N1,N2,N3,RAM EMG,EOG etc.
TUH abnormal (López et al., 2017) 2329 250,256,512 Hz 27 to 36 Abnormal detection Normal,Abnormal -
Sleep EDFx (PhysioBank, 2000; Kemp et al., 2000) 83 100 Hz 2 Sleep classification Weak,N1,N2,N3,RAM Breathe,ERP
MASS (O’reilly et al., 2014) 62 256Hz 20 Sleep classification Weak,N1,N2,N3,RAM EOG,EMG,ECG
MMI (Schalk et al., 2004) 105 160Hz 64 Motor imagery Rest, MI(left), MI(right) -
BCIC (Tangermann et al., 2012) 9 250Hz 22 Motor imagery MI(l),MI(r),MI(f),MI(t) EOG
Mayo-UPenn Seizure Dataset (Temko et al., 2015) 4 400Hz 16 Seizure detection Normal,Abnormal Dog signal
SHHS dataset(Quan et al., 1997) - 125Hz 14 Sleep classification Weak,N1,N2,N3,RAM EOG,Heart
MGH Sleep (Biswal et al., 2018) - 200Hz 6 Sleep classification Weak,N1,N2,N3,RAM -
Sleep EDF(PhysioBank, 2000; Kemp et al., 2000) 20 100Hz 2 Sleep classification Weak,N1,N2,N3,RAM EOG,EMG,ERP
Dreem Open Dataset(Goldberger et al., 2000) 80 250Hz 8,12 Sleep classification Weak,N1,N2,N3,RAM -
DEAP (Koelstra et al., 2011) 32 512Hz 32 Emotion recognition Arousal,Valance,Dominant Video,EOG,EMG
TUSZ (Shah et al., 2018) over 300 - 19 Seizure detection Different seizure types -
SEED (Zheng and Lu, 2015; Duan et al., 2013) 15 200Hz 62 Emotion recognition Negative,Neutral,Positive -
SEED-IV (Zheng et al., 2018) 15 200Hz 62 Emotion recognition Happy,Neutral,Sad,Fear -
MPED (Song et al., 2019) 23 1000Hz 62 Emotion recognition Different discrete emotions ECG,ESR,RSP
KU-MI (Lee et al., 2019) 52 1000Hz 62 Motor imagery MI(left),MI(right) EMG
ISRUC (Khalighi et al., 2016) 100,8,10 200Hz 6 Sleep classification Weak,N1,N2,N3,RAM Multiple signals
parrKULee (Bollens et al., 2023) 85 8192Hz 64 Speech decoding speech signal -
CHB-MIT (Shoeb, 2009) 24 256Hz 24-26 Seizure detection Seizure,Non-seizure -
MPI-LEMON (Babayan et al., 2018) 216 2500Hz 62 Non Resting states MRI,ECG etc.
Visual object (Gifford et al., 2022) 10 1000Hz 64 Image decoding Object label -
MAHNOB-HCI (Soleymani et al., 2011) 27 256Hz 32 Emotion recognition Arousal,Valance,Dominant Multiple signals
SEEG (Chen et al., 2022b) - 1000 or 2000Hz 52 to 124 Seizure detection Seizure, Non seizure Ecog
Epilepsy Dataset (Andrzejak et al., 2001) 500 173Hz 19 Seizure detection Five seizure labels -
AMIGOS (Miranda-Correa et al., 2018) 40 128Hz 14 Emotion recognition Arousal,Valance,Dominant ECG,GRS
DREAMER (Katsigiannis and Ramzan, 2017) 23 128Hz 14 Emotion recognition Arousal,Valance,Dominant ECG
ASD dataset (Chen et al., 2023) 4899 250Hz 20 to 129 AS-Disorder ASD lables -
NMT sculp dataset (Khan et al., 2022) - 250Hz 19 Pathology detection Normal,Abnormal -
CUHZ (Peng et al., 2022) 25 500,698,1000Hz 22 Seizure datection Different seizure types -

Sleep stage classification is the task of classifying sleep EEG signals into different stages. The criteria for sleep stage classification are proposed by the American Academy of Sleep Medicine (AASM), dividing sleep EEG signals into five stages (Dement and Kleitman, 1957): W stage is the weak stage, N1 and N2 stages (Non-REM stage) are the light sleep, N3 stage is deep sleep, and REM stage is the Rapid Eye Movement sleep (Aserinsky and Kleitman, 1953). Temporal models are used to capture the temporal correlation and difference between sleep stages to accurately identify the sleep stage to which the EEG sample belongs.

Speech/Image decoding is the complex task of decoding image or speech information from the EEG signals. This task involves translating brain activity patterns recorded by EEG into meaningful visual and speech information, which can help to understand the neural mechanisms of vision and audition in the brain (Song et al., 2023). Inspired by the visual question answering (Radford et al., 2021) that aligned the text and image patches through SSL to extract the semantic information, SSL can align the EEG signal and image, speech patches to capture the inner correlation and improve the task performance.

As the practical downstream tasks mentioned above, corresponding datasets have been proposed to train the model. For emotion recognition and motor imagery tasks, the existing datasets such as SEED (Zheng and Lu, 2015) and MMI (Schalk et al., 2004) contain EEG signal with more than 30 channels, where the fine-grained spatial correlation can be extracted; On the contrary, the datasets for sleep stage classification task contain fewer channels but longer time windows, where the temporal information is critical for sleep classifications. Besides, the pathology datasets contain more subjects to ensure that the general EEG features can be extracted for clinical application.

8. Future directions

In the EEG analysis field, combining deep model and SSL frameworks can help improve the model performance on various EEG-based tasks through extra parameter training on unlabeled EEG samples with well-designed pretext tasks. In addition to the advantages of EEG-based SSL frameworks, we analyze the challenges in the existing EEG-based SSL studies and propose potential future directions for EEG-based SSL to address the challenges and problems.

Signal-oriented pretext task. Most existing pretext tasks are the straightforward extension of pretext tasks in CV and NLP, which treat EEG signals as 2D matrix and temporal vector like image or text patches to capture spatial and contextual correlations but ignore the intrinsic characteristics of EEG signal. Therefore, designing the EEG-oriented pretext task to extract the spatial-temporal-frequency EEG features is a feasible approach worth further exploration.

Knowledge-driven SSL framework. Although SSL frameworks have achieved significant success in various EEG-based tasks, the lack of theoretical foundation and neural knowledge for EEG signals leads to the generated representations lacking generalization and interpretability. Therefore, how to integrate the EEG-based neural knowledge with the SSL framework to construct the knowledge-driven interpretable EEG model is another important direction, which needs to design specific pretext tasks and augmentation techniques that can fuse explainable neural knowledge into representation. We believe that by integrating knowledge of EEG into the self-supervised framework, the models are expected to bring generalization and interpretability to representations.

Graph-based SSL. Deep learning models like CNN and RNN have been widely used to extract spatial-temporal features from EEG signals for different tasks. However, most existing methods ignore the inherent topological connections among electrodes. EEG signals are generated from the activity of neurons that are topologically connected inner the brain. Graph neural networks can explore the inherent connectivity patterns among neurons, and we believe that researching GNN-based EEG SSL methods can integrate richer latent brain information into representations, offering a new perspective for information expression.

SSL for Heterogeneous EEG. The ultimate goal of SSL for EEG analysis is to generate general representations for various downstream tasks. However, EEG signals are collected from multiple scenarios which encompass variations in channel, device, sampling rate, task, subject, and distribution. The significant differences between EEG signals from different sources make it challenging for self-supervised training collaboratively. Therefore, constructing SSL framework tailored for heterogeneous EEG data is an important direction for future development. Exploration in this direction can utilize heterogeneous EEG samples from multiple sources to jointly pre-train the model to fully utilize existing differentiated EEG datasets to mine universal representations for different downstream tasks.

Multimodal SSL. SSL for EEG signals is the unimodal approach aiming to extract neural information from unlabeled EEG samples. However, the features mined from EEG signals are difficult to adapt to some complex downstream tasks, which require other brain or physiological signals to provide more abundant information. Therefore, the EEG-based multimodal self-supervised learning method needs to be further studied to extract integrated and aligned features from unlabeled multimodal signals (ECG, EMG, EOG, etc) for challenging downstream tasks.

9. Conclusion

This paper is a comprehensive review of self-supervised learning for EEG analysis, including the reasonable taxonomy, different kinds of existing EEG-based SSL methods, downstream EEG tasks, and the available training datasets, offering detailed guidelines for researchers interested in deep learning combined with EEG analysis. We first review typical SSL frameworks and pretext tasks in the CV and NLP and introduce traditional supervised EEG analysis methods as the preliminary, to illustrate the drawbacks of supervised EEG analysis and underscore the necessity of introducing SSL for EEG analysis. We then provide a detailed exposition on four categories of SSL frameworks for EEG analysis, elucidating the technical details of representative methods to extract spatial-temporal-frequency features from EEG signals. Subsequently, we enumerate EEG-based downstream tasks effective for SSL frameworks and present relevant EEG datasets suitable for pre-training or downstream task fine-tuning. Finally, we discuss the challenges in the existing studies and propose new insights and potential future directions that warrant exploration, which can help generate a more general explainable representation to solve various complex downstream tasks.

References

  • (1)
  • Accou et al. (2023) Bernd Accou, Tom Francart, et al. 2023. Self-supervised enhancement of stimulus-evoked brain response data. arXiv preprint arXiv:2302.01924 (2023).
  • Aguiar-Conraria and Soares (2014) Luís Aguiar-Conraria and Maria Joana Soares. 2014. The continuous wavelet transform: Moving beyond uni-and bivariate analysis. Journal of Economic Surveys 28, 2 (2014), 344–375.
  • Al-Quraishi et al. (2018) Maged S Al-Quraishi, Irraivan Elamvazuthi, Siti Asmah Daud, S Parasuraman, and Alberto Borboni. 2018. EEG-based control for upper and lower limb exoskeletons and prostheses: A systematic review. Sensors 18, 10 (2018), 3342.
  • Alotaiby et al. (2014) Turkey N Alotaiby, Saleh A Alshebeili, Tariq Alshawi, Ishtiaq Ahmad, and Fathi E Abd El-Samie. 2014. EEG seizure detection and prediction algorithms: a survey. EURASIP Journal on Advances in Signal Processing 2014 (2014), 1–21.
  • Altaheri et al. (2023) Hamdi Altaheri, Ghulam Muhammad, Mansour Alsulaiman, Syed Umar Amin, Ghadir Ali Altuwaijri, Wadood Abdul, Mohamed A Bencherif, and Mohammed Faisal. 2023. Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: A review. Neural Computing and Applications 35, 20 (2023), 14681–14722.
  • Andrzejak et al. (2001) Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. 2001. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E 64, 6 (2001), 061907.
  • Aserinsky and Kleitman (1953) Eugene Aserinsky and Nathaniel Kleitman. 1953. Regularly occurring periods of eye motility, and concomitant phenomena, during sleep. Science 118, 3062 (1953), 273–274.
  • Babayan et al. (2018) A Babayan, M Erbey, D Kumral, JD Reinelt, AMF Reiter, J Röbbig, H Lina Schaare, M Uhlig, A Anwander, PL Bazin, et al. 2018. Data descriptor: a mind-brain-body dataset of MRI, EEG, cognition, emotion, and peripheral physiology in young and old adults. Sci. Data 6, 180308.
  • Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  • Bagchi and Mitra (2012) Sonali Bagchi and Sanjit K Mitra. 2012. The nonuniform discrete Fourier transform and its applications in signal processing. Vol. 463. Springer Science & Business Media.
  • Balconi and Lucchiari (2006) Michela Balconi and Claudio Lucchiari. 2006. EEG correlates (event-related desynchronization) of emotional face elaboration: a temporal analysis. Neuroscience letters 392, 1-2 (2006), 118–123.
  • Banville et al. (2019) Hubert Banville, Isabela Albuquerque, Aapo Hyvärinen, Graeme Moffat, Denis-Alexander Engemann, and Alexandre Gramfort. 2019. Self-supervised representation learning from electroencephalography signals. In 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6.
  • Banville et al. (2021) Hubert Banville, Omar Chehab, Aapo Hyvärinen, Denis-Alexander Engemann, and Alexandre Gramfort. 2021. Uncovering the structure of clinical EEG signals with self-supervised learning. Journal of Neural Engineering 18, 4 (2021), 046020.
  • Bhat and Hortal (2021) Sudhanva Bhat and Enrique Hortal. 2021. Gan-based data augmentation for improving the classification of eeg signals. In The 14th pervasive technologies related to assistive environments conference. 453–458.
  • Biswal et al. (2018) Siddharth Biswal, Haoqi Sun, Balaji Goparaju, M Brandon Westover, Jimeng Sun, and Matt T Bianchi. 2018. Expert-level sleep scoring with deep neural networks. Journal of the American Medical Informatics Association 25, 12 (2018), 1643–1650.
  • Bollens et al. (2023) Lies Bollens, Bernd Accou, Marlies Gillis, Wendy Verheijen, Tom Francart, et al. 2023. SparrKULee: A Speech-evoked Auditory Response Repository of the KU Leuven, containing EEG of 85 participants. (2023).
  • Boostani et al. (2017) Reza Boostani, Foroozan Karimzadeh, and Mohammad Nami. 2017. A comparative review on sleep stage classification methods in patients and healthy individuals. Computer methods and programs in biomedicine 140 (2017), 77–91.
  • Bos et al. (2006) Danny Oude Bos et al. 2006. EEG-based emotion recognition. The influence of visual and auditory stimuli 56, 3 (2006), 1–17.
  • Brüsch et al. (2023) Thea Brüsch, Mikkel N Schmidt, and Tommy S Alstrøm. 2023. Multi-view self-supervised learning for multivariate variable-channel time series. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6.
  • Cai et al. (2023) Donghong Cai, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. 2023. MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals. arXiv preprint arXiv:2306.13102 (2023).
  • Chang et al. (2022) Shuohua Chang, Zhihong Yang, Yuyang You, and Xiaoyu Guo. 2022. Dssnet: A deep sequential sleep network for self-supervised representation learning based on single-channel eeg. IEEE Signal Processing Letters 29 (2022), 2143–2147.
  • Chen et al. (2023) He Chen, Ouyang Gaoxiang, and Xiaoli Li. 2023. Extracting Temporal-Spectral-Spatial Representation of EEG Using Self-Supervised Learning for the Identification of Children with ASD. In 2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 1263–1266.
  • Chen et al. (2022b) Junru Chen, Yang Yang, Tao Yu, Yingying Fan, Xiaolong Mo, and Carl Yang. 2022b. Brainnet: Epileptic wave detection from seeg with hierarchical graph diffusion learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2741–2751.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  • Chen et al. (2022a) Xun Chen, Chang Li, Ai** Liu, Martin J McKeown, Ruobing Qian, and Z Jane Wang. 2022a. Toward open-world electroencephalogram decoding via deep learning: A comprehensive survey. IEEE Signal Processing Magazine 39, 2 (2022), 117–134.
  • Cheng et al. (2020) Joseph Y Cheng, Hanlin Goh, Kaan Dogrusoz, Oncel Tuzel, and Erdrin Azemi. 2020. Subject-aware contrastive learning for biosignals. arXiv preprint arXiv:2007.04871 (2020).
  • Chien et al. (2022) Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M Sandino, and Joseph Y Cheng. 2022. MAEEG: Masked Auto-encoder for EEG Representation Learning. arXiv preprint arXiv:2211.02625 (2022).
  • Choi et al. (2020) Junhyuk Choi, Keun Tae Kim, Ji Hyeok Jeong, Laehyun Kim, Song Joo Lee, and Hyungmin Kim. 2020. Develo** a motor imagery-based real-time asynchronous hybrid BCI controller for a lower-limb exoskeleton. Sensors 20, 24 (2020), 7309.
  • Cimtay and Ekmekcioglu (2020) Yucel Cimtay and Erhan Ekmekcioglu. 2020. Investigating the use of pretrained convolutional neural network on cross-subject and cross-dataset EEG emotion recognition. Sensors 20, 7 (2020), 2034.
  • Craik et al. (2019) Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. 2019. Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering 16, 3 (2019), 031001.
  • Creswell et al. (2018) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE signal processing magazine 35, 1 (2018), 53–65.
  • Das et al. (2022) Sudip Das, Pankaj Pandey, and Krishna Prasad Miyapuram. 2022. Improving self-supervised pretraining models for epileptic seizure detection from EEG data. arXiv preprint arXiv:2207.06911 (2022).
  • Défossez et al. (2023) Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. 2023. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence (2023), 1–11.
  • Dement and Kleitman (1957) William Dement and Nathaniel Kleitman. 1957. Cyclic variations in EEG during sleep and their relation to eye movements, body motility, and dreaming. Electroencephalography and clinical neurophysiology 9, 4 (1957), 673–690.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • Du et al. (2015) Nguyen Duy Du, Nguyen Hoang Huy, and Nguyen Xuan Hoai. 2015. The impact of high dimensionality on SVM when classifying ERP data-a solution from LDA. In Proceedings of the 6th International Symposium on Information and Communication Technology. 32–37.
  • Duan et al. (2013) Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In 6th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 81–84.
  • Eldele et al. (2021) Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2021. Time-series representation learning via temporal and contextual contrasting. arXiv preprint arXiv:2106.14112 (2021).
  • Ericsson et al. (2022) Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. 2022. Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine 39, 3 (2022), 42–62.
  • Fahimi et al. (2019) Fatemeh Fahimi, Zhuo Zhang, Wooi Boon Goh, Kai Keng Ang, and Cuntai Guan. 2019. Towards EEG generation using GANs for BCI applications. In 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). IEEE, 1–4.
  • Fu et al. (2022) Ruiqi Fu, Yi-Feng Chen, Yongqi Huang, Shu** Chen, Feiyan Duan, Jiewei Li, Jianhui Wu, Dongmei Jiang, Junling Gao, Jason Gu, et al. 2022. Symmetric convolutional and adversarial neural network enables improved mental stress classification from EEG. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30 (2022), 1384–1400.
  • Gao et al. ([n. d.]) Wei Gao, Zhengqing Hu, Yu Lei, Changming Wang, Fangbing Qiu, Yanqing Liu, and Lin Han. [n. d.]. A Multi-Channel Sleep Staging Method Based on Self-Supervised Learning. Available at SSRN 4580453 ([n. d.]).
  • Ge et al. (2021) Wendong Ge, ** **g, Sungtae An, Aline Herlopian, Marcus Ng, Aaron F Struck, Brian Appavu, Emily L Johnson, Gamaleldin Osman, Hiba A Haider, et al. 2021. Deep active learning for interictal ictal injury continuum EEG patterns. Journal of neuroscience methods 351 (2021), 108966.
  • Ghassemi et al. (2018) Mohammad M Ghassemi, Benjamin E Moody, Li-Wei H Lehman, Christopher Song, Qiao Li, Haoqi Sun, Roger G Mark, M Brandon Westover, and Gari D Clifford. 2018. You snooze, you win: the physionet/computing in cardiology challenge 2018. In 2018 Computing in Cardiology Conference (CinC), Vol. 45. IEEE, 1–4.
  • Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).
  • Gifford et al. (2022) Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. 2022. A large and rich EEG dataset for modeling human visual object recognition. NeuroImage 264 (2022), 119754.
  • Goldberger et al. (2000) Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation 101, 23 (2000), e215–e220.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  • Gotman (1982) Jean Gotman. 1982. Automatic recognition of epileptic seizures in the EEG. Electroencephalography and clinical Neurophysiology 54, 5 (1982), 530–540.
  • Gramfort et al. (2021) Alexandre Gramfort, Hubert Banville, Omar Chehab, Aapo Hyvärinen, and Denis Engemann. 2021. Learning with self-supervision on EEG data. In 2021 9th International Winter Conference on Brain-Computer Interface (BCI). IEEE, 1–2.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
  • Guo et al. (2023) Yunfei Guo, Tao Zhang, and Wu Huang. 2023. Emotion recognition based on multi-modal electrophysiology multi-head attention Contrastive Learning. arXiv preprint arXiv:2308.01919 (2023).
  • Hallgarten et al. (2023) Philipp Hallgarten, David Bethge, Ozan Özdcnizci, Tobias Grosse-Puppendahl, and Enkelejda Kasneci. 2023. TS-MoCo: Time-Series Momentum Contrast for Self-Supervised Physiological Representation Learning. In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 1030–1034.
  • Han et al. (2021) **pei Han, Xiao Gu, and Benny Lo. 2021. Semi-supervised contrastive learning for generalizable motor imagery eeg classification. In 2021 IEEE 17th International Conference on Wearable and Implantable Body Sensor Networks (BSN). IEEE, 1–4.
  • Harpale and Bairagi (2016) Varsha K Harpale and Vinayak K Bairagi. 2016. Time and frequency domain analysis of EEG signals for seizure detection: A review. In 2016 International Conference on Microelectronics, Computing and Communications (MicroCom). IEEE, 1–6.
  • He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
  • Henaff (2020) Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning. PMLR, 4182–4192.
  • Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science 313, 5786 (2006), 504–507.
  • Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
  • Ho and Armanfard (2023) Thi Kieu Khanh Ho and Narges Armanfard. 2023. Self-supervised learning for anomalous channel detection in EEG graphs: application to seizure analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7866–7874.
  • Hojjati (2023) Amirabbas Hojjati. 2023. A Multi-View Self-Supervised Approach to Learn Representations of EEG Data for Downstream Prediction Tasks. Master’s thesis. NTNU.
  • Hosseini et al. (2020) Mohammad-Parsa Hosseini, Amin Hosseini, and Kiarash Ahi. 2020. A review on machine learning for EEG signal processing in bioengineering. IEEE reviews in biomedical engineering 14 (2020), 204–218.
  • Huang et al. (2023) Baichuan Huang, Renato Zanetti, Azra Abtahi, David Atienza, and Amir Aminifar. 2023. Epilepsynet: Interpretable self-supervised seizure detection for low-power wearable systems. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–5.
  • Jackson and Bolger (2014) Alice F Jackson and Donald J Bolger. 2014. The neurophysiological bases of EEG and EEG measurement: A review for the rest of us. Psychophysiology 51, 11 (2014), 1061–1071.
  • Jaiswal et al. (2020) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. Technologies 9, 1 (2020), 2.
  • Jiang et al. (2021) Xue Jiang, Jianhui Zhao, Bo Du, and Zhiyong Yuan. 2021. Self-supervised contrastive learning for EEG-based sleep staging. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  • Jiao et al. (2020) Yingying Jiao, Yini Deng, Yun Luo, and Bao-Liang Lu. 2020. Driver sleepiness detection from EEG and EOG signals using GAN and LSTM networks. Neurocomputing 408 (2020), 100–111.
  • Jo et al. (2023) Sangmin Jo, Jaehyun Jeon, Seungwoo Jeong, and Heung-Il Suk. 2023. Channel-Aware Self-Supervised Learning for EEG-based BCI. In 2023 11th International Winter Conference on Brain-Computer Interface (BCI). IEEE, 1–4.
  • Kalafatovich et al. (2020) Jenifer Kalafatovich, Minji Lee, and Seong-Whan Lee. 2020. Decoding visual recognition of objects from eeg signals based on attention-driven convolutional neural network. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2985–2990.
  • Kan et al. (2023) Haoning Kan, Jiale Yu, Jia** Huang, Zihe Liu, Heqian Wang, and Haiyan Zhou. 2023. Self-supervised group meiosis contrastive learning for eeg-based emotion recognition. Applied Intelligence (2023), 1–19.
  • Katsigiannis and Ramzan (2017) Stamos Katsigiannis and Naeem Ramzan. 2017. DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE journal of biomedical and health informatics 22, 1 (2017), 98–107.
  • Kemp et al. (2000) Bob Kemp, Aeilko H Zwinderman, Bert Tuk, Hilbert AC Kamphuisen, and Josefien JL Oberye. 2000. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG. IEEE Transactions on Biomedical Engineering 47, 9 (2000), 1185–1194.
  • Kendall et al. (2018) Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491.
  • Khalighi et al. (2016) Sirvan Khalighi, Teresa Sousa, José Moutinho Santos, and Urbano Nunes. 2016. ISRUC-Sleep: A comprehensive public dataset for sleep researchers. Computer methods and programs in biomedicine 124 (2016), 180–192.
  • Khan et al. (2022) Hassan Aqeel Khan, Rahat Ul Ain, Awais Mehmood Kamboh, Hammad Tanveer Butt, Saima Shafait, Wasim Alamgir, Didier Stricker, and Faisal Shafait. 2022. The NMT scalp EEG dataset: an open-source annotated dataset of healthy and pathological EEG recordings for predictive modeling. Frontiers in neuroscience 15 (2022), 755817.
  • Ko and Suk (2022) Wonjun Ko and Heung-Il Suk. 2022. Eeg-oriented self-supervised learning and cluster-aware adaptation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4143–4147.
  • Koelstra et al. (2011) Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2011. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on affective computing 3, 1 (2011), 18–31.
  • Kostas et al. (2021) Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. 2021. BENDR: using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience 15 (2021), 653659.
  • Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E 69, 6 (2004), 066138.
  • Kumar et al. (2022) Vamsi Kumar, Likith Reddy, Shivam Kumar Sharma, Kamalaker Dadi, Chiranjeevi Yarra, Raju S Bapi, and Srijithesh Rajendran. 2022. mulEEG: a multi-view representation learning on EEG signals. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 398–407.
  • Lee et al. (2022) Harim Lee, Eunseon Seong, and Dong-Kyu Chae. 2022. Self-supervised learning with attention-based latent signal augmentation for sleep staging with limited labeled data. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, LD Raedt, Ed. International Joint Conferences on Artificial Intelligence Organization, Vol. 7. 3868–3876.
  • Lee et al. (2019) Min-Ho Lee, O-Yeon Kwon, Yong-Jeong Kim, Hong-Kyung Kim, Young-Eun Lee, John Williamson, Siamac Fazli, and Seong-Whan Lee. 2019. EEG dataset and OpenBMI toolbox for three BCI paradigms: An investigation into BCI illiteracy. GigaScience 8, 5 (2019), giz002.
  • Lesaja et al. (2022) Srdjan Lesaja, Morgan Stuart, Jerry J Shih, Pedram Z Soroush, Tanja Schultz, Milos Manic, and Dean J Krusienski. 2022. Self-Supervised Learning of Neural Speech Representations From Unlabeled Intracranial Signals. IEEE Access 10 (2022), 133526–133538.
  • Li et al. (2022c) Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. 2022c. A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning. In Proceedings of the 30th ACM International Conference on Multimedia. 6–14.
  • Li and Metsis (2022) Xiaomin Li and Vangelis Metsis. 2022. Spp-eegnet: An input-agnostic self-supervised eeg representation model for inter-dataset transfer learning. In International Conference on Computing and Information Technology. Springer, 173–182.
  • Li et al. (2022a) Yang Li, Ji Chen, Fu Li, Boxun Fu, Hao Wu, Youshuo Ji, Yi** Zhou, Yi Niu, Guangming Shi, and Wenming Zheng. 2022a. GMSS: Graph-based multi-task self-supervised learning for EEG emotion recognition. IEEE Transactions on Affective Computing (2022).
  • Li et al. (2022b) Yamei Li, Shengqiong Luo, Haibo Zhang, Yinkai Zhang, Yuan Zhang, and Benny Lo. 2022b. MtCLSS: Multi-Task Contrastive Learning for Semi-Supervised Pediatric Sleep Staging. IEEE Journal of Biomedical and Health Informatics (2022).
  • Li et al. (2019) Yitong Li, Michael Murias, Samantha Major, Geraldine Dawson, and David Carlson. 2019. On target shift in adversarial domain adaptation. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 616–625.
  • Liu et al. (2020) Junxiu Liu, Guopei Wu, Yuling Luo, Senhui Qiu, Su Yang, Wei Li, and Yifei Bi. 2020. EEG-based emotion classification using a deep neural network and sparse autoencoder. Frontiers in Systems Neuroscience 14 (2020), 43.
  • Liu et al. (2022) Yixin Liu, Ming **, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and S Yu Philip. 2022. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering 35, 6 (2022), 5879–5900.
  • López et al. (2017) Silvia López, I Obeid, and J Picone. 2017. Automated interpretation of abnormal adult electroencephalograms. Ph. D. Dissertation.
  • Miranda-Correa et al. (2018) Juan Abdon Miranda-Correa, Mojtaba Khomami Abadi, Nicu Sebe, and Ioannis Patras. 2018. Amigos: A dataset for affect, personality and mood research on individuals and groups. IEEE Transactions on Affective Computing 12, 2 (2018), 479–493.
  • Mirzaei and Ghasemi (2021) Sayeh Mirzaei and Parisa Ghasemi. 2021. EEG motor imagery classification using dynamic connectivity patterns and convolutional autoencoder. Biomedical Signal Processing and Control 68 (2021), 102584.
  • Mohsenvand et al. (2020) Mostafa Neo Mohsenvand, Mohammad Rasool Izadi, and Pattie Maes. 2020. Contrastive representation learning for electroencephalogram classification. In Machine Learning for Health. PMLR, 238–253.
  • Montero Quispe et al. (2022) Kevin G Montero Quispe, Daniel MS Utyiama, Eulanda M Dos Santos, Horácio ABF Oliveira, and Eduardo JP Souto. 2022. Applying self-supervised representation learning for emotion recognition using physiological signals. Sensors 22, 23 (2022), 9102.
  • Noroozi and Favaro (2016) Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision. Springer, 69–84.
  • Oh et al. (2014) Seung-Hyeon Oh, Yu-Ri Lee, and Hyoung-Nam Kim. 2014. A novel EEG feature extraction method using Hjorth parameter. International Journal of Electronics and Electrical Engineering 2, 2 (2014), 106–110.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  • O’reilly et al. (2014) Christian O’reilly, Nadia Gosselin, Julie Carrier, and Tore Nielsen. 2014. Montreal Archive of Sleep Studies: an open-access resource for instrument benchmarking and exploratory research. Journal of sleep research 23, 6 (2014), 628–635.
  • Ou et al. (2022) Yanghan Ou, Siqin Sun, Haitao Gan, Ran Zhou, and Zhi Yang. 2022. An improved self-supervised learning for EEG classification. Math. Biosci. Eng 19 (2022), 6907–6922.
  • Palo et al. (2015) HK Palo, Mihir Narayana Mohanty, and Mahesh Chandra. 2015. Use of different features for emotion recognition using MLP network. In Computational Vision and Robotics: Proceedings of ICCVR 2014. Springer, 7–15.
  • Partovi et al. (2023) Andi Partovi, Anthony N Burkitt, and David Grayden. 2023. A Self-Supervised Task-Agnostic Embedding for EEG Signals. In 2023 11th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 1–4.
  • Peng et al. (2022) Ruimin Peng, Changming Zhao, Jun Jiang, Guangtao Kuang, Yuqi Cui, Yifan Xu, Hao Du, Jianbo Shao, and Dongrui Wu. 2022. TIE-EEGNet: Temporal information enhanced EEGNet for seizure subtype classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30 (2022), 2567–2576.
  • Peng et al. (2023) Ruimin Peng, Changming Zhao, Yifan Xu, Jun Jiang, Guangtao Kuang, Jianbo Shao, and Dongrui Wu. 2023. WAVELET2VEC: A Filter Bank Masked Autoencoder for EEG-Based Seizure Subtype Classification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  • Petrantonakis and Hadjileontiadis (2009) Panagiotis C Petrantonakis and Leontios J Hadjileontiadis. 2009. Emotion recognition from EEG using higher order crossings. IEEE Transactions on information Technology in Biomedicine 14, 2 (2009), 186–197.
  • PhysioBank (2000) PhysioToolkit PhysioBank. 2000. Physionet: components of a new research resource for complex physiologic signals. Circulation 101, 23 (2000), e215–e220.
  • Pulver et al. (2023) Dustin Pulver, Prithila Angkan, Paul Hungler, and Ali Etemad. 2023. EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning. In Proceedings of the 25th International Conference on Multimodal Interaction. 190–197.
  • Quan et al. (1997) Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O’Connor, David M Rapoport, Susan Redline, John Robbins, Jonathan M Samet, et al. 1997. The sleep heart health study: design, rationale, and methods. Sleep 20, 12 (1997), 1077–1085.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Rafiei et al. (2022) Mohammad H Rafiei, Lynne V Gauthier, Hojjat Adeli, and Daniel Takabi. 2022. Self-supervised learning for electroencephalography. IEEE Transactions on Neural Networks and Learning Systems (2022).
  • Roach and Mathalon (2008) Brian J Roach and Daniel H Mathalon. 2008. Event-related EEG time-frequency analysis: an overview of measures and an analysis of early gamma band phase locking in schizophrenia. Schizophrenia bulletin 34, 5 (2008), 907–926.
  • Rodenbeck et al. (2006) Andrea Rodenbeck, Ralf Binder, Peter Geisler, Heidi Danker-Hopfe, Reimer Lund, Friedhart Raschke, Hans-Günther Weeß, and Hartmut Schulz. 2006. A review of sleep EEG patterns. Part I: A compilation of amended rules for their visual recognition according to Rechtschaffen and Kales. Somnologie 10, 4 (2006), 159–175.
  • Sabbagh et al. (2020) David Sabbagh, Pierre Ablin, Gaël Varoquaux, Alexandre Gramfort, and Denis A Engemann. 2020. Predictive regression modeling with MEG/EEG: from source power to signals and cognitive states. NeuroImage 222 (2020), 116893.
  • Schalk et al. (2004) Gerwin Schalk, Dennis J McFarland, Thilo Hinterberger, Niels Birbaumer, and Jonathan R Wolpaw. 2004. BCI2000: a general-purpose brain-computer interface (BCI) system. IEEE Transactions on biomedical engineering 51, 6 (2004), 1034–1043.
  • Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  • Shah et al. (2018) Vinit Shah, Eva Von Weltin, Silvia Lopez, James Riley McHugh, Lillian Veloso, Meysam Golmohammadi, Iyad Obeid, and Joseph Picone. 2018. The temple university hospital seizure detection corpus. Frontiers in neuroinformatics 12 (2018), 83.
  • Shen et al. (2022) Xinke Shen, Xianggen Liu, Xin Hu, Dan Zhang, and Sen Song. 2022. Contrastive learning of subject-invariant eeg representations for cross-subject emotion recognition. IEEE Transactions on Affective Computing (2022).
  • Shoeb (2009) Ali Hossam Shoeb. 2009. Application of machine learning to epileptic seizure onset detection and treatment. Ph. D. Dissertation. Massachusetts Institute of Technology.
  • Shoeibi et al. (2021) Afshin Shoeibi, Navid Ghassemi, Roohallah Alizadehsani, Modjtaba Rouhani, Hossein Hosseini-Nejad, Abbas Khosravi, Maryam Panahiazar, and Saeid Nahavandi. 2021. A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in EEG signals. Expert Systems with Applications 163 (2021), 113788.
  • Singh and Malhotra (2022) Kuldeep Singh and Jyoteesh Malhotra. 2022. Smart neurocare approach for detection of epileptic seizures using deep learning based temporal analysis of EEG patterns. Multimedia Tools and Applications 81, 20 (2022), 29555–29586.
  • Siuly et al. (2016) Siuly Siuly, Yan Li, and Yanchun Zhang. 2016. EEG signal analysis and classification. IEEE Trans Neural Syst Rehabilit Eng 11 (2016), 141–144.
  • Soleymani et al. (2011) Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. 2011. A multimodal database for affect recognition and implicit tagging. IEEE transactions on affective computing 3, 1 (2011), 42–55.
  • Song et al. (2019) Tengfei Song, Wenming Zheng, Cheng Lu, Yuan Zong, Xilei Zhang, and Zhen Cui. 2019. MPED: A multi-modal physiological emotion database for discrete emotion recognition. IEEE Access 7 (2019), 12177–12191.
  • Song et al. (2023) Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. 2023. Decoding Natural Images from EEG for Object Recognition. arXiv preprint arXiv:2308.13234 (2023).
  • Tangermann et al. (2012) Michael Tangermann, Klaus-Robert Müller, Ad Aertsen, Niels Birbaumer, Christoph Braun, Clemens Brunner, Robert Leeb, Carsten Mehring, Kai J Miller, Gernot Mueller-Putz, et al. 2012. Review of the BCI competition IV. Frontiers in neuroscience (2012), 55.
  • Temko et al. (2015) Andriy Temko, Achintya Sarkar, and Gordon Lightbody. 2015. Detection of seizures in intracranial EEG: UPenn and Mayo Clinic’s seizure detection challenge. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 6582–6585.
  • Teplan et al. (2002) Michal Teplan et al. 2002. Fundamentals of EEG measurement. Measurement science review 2, 2 (2002), 1–11.
  • Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057 (2000).
  • Übeyli (2009) Elif Derya Übeyli. 2009. Statistics over features: EEG signals analysis. Computers in Biology and Medicine 39, 8 (2009), 733–741.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wagh et al. (2021) Neeraj Wagh, Jionghao Wei, Samarth Rawal, Brent Berry, Leland Barnard, Benjamin Brinkmann, Gregory Worrell, David Jones, and Yogatheesan Varatharajah. 2021. Domain-guided self-supervision of eeg data improves downstream classification performance and generalizability. In Machine Learning for Health. PMLR, 130–142.
  • Wang et al. (2023) Xingyi Wang, Yuliang Ma, Jared Cammon, Feng Fang, Yunyuan Gao, and Yingchun Zhang. 2023. Self-Supervised EEG Emotion Recognition Models Based on CNN. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 1952–1962.
  • Wang and Qi (2022) Xiao Wang and Guo-Jun Qi. 2022. Contrastive learning with stronger augmentations. IEEE transactions on pattern analysis and machine intelligence 45, 5 (2022), 5549–5560.
  • Wen and Zhang (2018) Tingxi Wen and Zhongnan Zhang. 2018. Deep convolution neural network and autoencoders-based unsupervised feature learning of EEG signals. IEEE Access 6 (2018), 25399–25410.
  • Weng et al. (2022) Weining Weng, Yang Gu, Yiqiang Chen, Guoqiang Wang, and Nianfeng Shi. 2022. An Efficient Spatial-Temporal Representation Method for EEG Emotion Recognition. In 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta). IEEE, 458–467.
  • Weng et al. (2023) Weining Weng, Yang Gu, Qihui Zhang, Yingying Huang, Chunyan Miao, and Yiqiang Chen. 2023. A Knowledge-Driven Cross-view Contrastive Learning for EEG Representation. arXiv preprint arXiv:2310.03747 (2023).
  • Wu et al. (2022) Di Wu, Siyuan Li, Jie Yang, and Mohamad Sawan. 2022. neuro2vec: Masked fourier spectrum prediction for neurophysiological representation learning. arXiv preprint arXiv:2204.12440 (2022).
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
  • Xi et al. (2022) Liang Xi, Zichao Yun, Han Liu, Ruidong Wang, Xunhua Huang, and Haoyi Fan. 2022. Semi-supervised time series classification model with self-supervised learning. Engineering Applications of Artificial Intelligence 116 (2022), 105331.
  • Xiao et al. (2021) Qinfeng Xiao, **g Wang, Jianan Ye, Hongjun Zhang, Yuyan Bu, Yiqiong Zhang, and Hao Wu. 2021. Self-supervised learning for sleep stage classification with predictive and discriminative contrastive coding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1290–1294.
  • Xiao et al. (2024) Tiantian Xiao, Ziwei Wang, Yongfeng Zhang, Shuai Wang, Hailing Feng, Yanna Zhao, et al. 2024. Self-supervised Learning with Attention Mechanism for EEG-based seizure detection. Biomedical Signal Processing and Control 87 (2024), 105464.
  • Xu et al. (2020) Junjie Xu, Yaojia Zheng, Yifan Mao, Ruixuan Wang, and Wei-Shi Zheng. 2020. Anomaly detection on electroencephalography with self-supervised learning. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 363–368.
  • Yang et al. (2023) Chaoqi Yang, Cao Xiao, M Brandon Westover, Jimeng Sun, et al. 2023. Self-supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study. JMIR AI 2, 1 (2023), e46769.
  • Ye et al. (2021) Jianan Ye, Qinfeng Xiao, **g Wang, Hongjun Zhang, Jiaoxue Deng, and Youfang Lin. 2021. Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Processing Letters 29 (2021), 189–193.
  • Ye et al. (2023) Weishan Ye, Zhiguo Zhang, Min Zhang, Fei Teng, Li Zhang, Linling Li, Gan Huang, Jianhong Wang, Dong Ni, and Zhen Liang. 2023. Semi-Supervised Dual-Stream Self-Attentive Adversarial Graph Contrastive Learning for Cross-Subject EEG-based Emotion Recognition. arXiv preprint arXiv:2308.11635 (2023).
  • You et al. (2023) Yuyang You, Shuohua Chang, Zhihong Yang, and Qihang Sun. 2023. PSNSleep: a self-supervised learning method for sleep staging based on Siamese networks with only positive sample pairs. Frontiers in Neuroscience 17 (2023), 1167723.
  • Zbontar et al. (2021) Jure Zbontar, Li **g, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning. PMLR, 12310–12320.
  • Zhai et al. (2018) Junhai Zhai, Sufang Zhang, Junfen Chen, and Qiang He. 2018. Autoencoder and its various variants. In 2018 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 415–419.
  • Zhang et al. (2022a) Hongjun Zhang, **g Wang, Jiahong Xiong, Yuxuan Ding, Zhenliang Gan, and Youfang Lin. 2022a. Expert knowledge inspired contrastive learning for sleep staging. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–6.
  • Zhang et al. (2008) Lin Zhang, Jonathan Samet, Brian Caffo, Isaac Bankman, and Naresh M Punjabi. 2008. Power spectral analysis of EEG activity during sleep in cigarette smokers. Chest 133, 2 (2008), 427–432.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 649–666.
  • Zhang and Chen (2016) Tao Zhang and Wanzhong Chen. 2016. LMD based features for the automatic seizure detection of EEG signals using SVM. IEEE Transactions on Neural Systems and Rehabilitation Engineering 25, 8 (2016), 1100–1108.
  • Zhang et al. (2022b) Wenrui Zhang, Ling Yang, Shijia Geng, and Shenda Hong. 2022b. Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer. arXiv preprint arXiv:2205.09928 (2022).
  • Zhang et al. (2022c) Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022c. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems 35 (2022), 3988–4003.
  • Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2021), 5586–5609.
  • Zhang et al. (2022d) Zhi Zhang, Sheng-hua Zhong, and Yan Liu. 2022d. GANSER: A self-supervised data augmentation framework for EEG-based emotion recognition. IEEE Transactions on Affective Computing (2022).
  • Zheng et al. (2018) Wei-Long Zheng, Wei Liu, Yifei Lu, Bao-Liang Lu, and Andrzej Cichocki. 2018. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE transactions on cybernetics 49, 3 (2018), 1110–1122.
  • Zheng and Lu (2015) Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks. IEEE Transactions on Autonomous Mental Development 7, 3 (2015), 162–175. https://doi.org/10.1109/TAMD.2015.2431497
  • Zheng et al. (2022) Yaojia Zheng, Zhouwu Liu, Rong Mo, Ziyi Chen, Wei-shi Zheng, and Ruixuan Wang. 2022. Task-oriented self-supervised learning for anomaly detection in electroencephalography. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 193–203.
  • Zhu et al. (2023) Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, and Yuchen Hu. 2023. Eeg2vec: Self-Supervised Electroencephalographic Representation Learning. arXiv preprint arXiv:2305.13957 (2023).