Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition

Kai Shao Huazhong University of Science and TechnologyWuhanChina Rui Wang Huazhong University of Science and TechnologyWuhanChina Yixue Hao Huazhong University of Science and TechnologyWuhanChina Long Hu Huazhong University of Science and TechnologyWuhanChina Min Chen South China University of TechnologyGuangzhouChina  and  Hans Arno Jacobsen University of TorontoTorontoCanada
(2024)
Abstract.

Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.

Depression Recognition, Multimodal Physiological Signals, Spatio-temporal Contrasting, Semantic Consistency
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: Proceedings of the 32th ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, Australiaisbn: 978-1-4503-XXXX-X/18/06submissionid: 4971ccs: Computing methodologies Artificial intelligenceccs: Computing methodologies Cognitive scienceccs: Human-centered computing HCI design and evaluation methods

1. Introduction

Depression is a common mental disorder, which is different from regular mood changes and feelings about everyday life. Characterized by persistent feelings of sadness, lack of interest, social withdrawal, diminished social skills, and even physical symptoms such as dizziness and nausea, depression significantly affects various aspects of life, including relationships with family, friends, and the community, as well as work and study efficiency (Taquet et al., 2021; Malhi and Mann, 2018; Khazanov et al., 2020). It is estimated that about 3.8% of the population are experiencing depression (Organization, 2023) and more than 700 thousand people die due to suicide every year (Organization, 2022).

The first recurrence rate of depression reaches 50% and repeated attacks significantly increase the disability rate. The significant factors affecting diagnosis and treatment are lack of resources and trained healthcare personnel. In addition, the inability to make an accurate assessment is another factor affecting effective treatment. Thus, it is urgent to enhance the accuracy of depression recognition and assessment at early stages, aiming to diminish both recurrence and disability rates. Currently, the recognition and assessment of depression mainly depend on the experienced doctors to perform clinical diagnosis based on professional scales such as the Patient Health Questionnaire (PHQ-9) (Kroenke and Spitzer, 2002) and Beck Depression Inventory (BDI-II) (He et al., 2022b), as well as biomarker data. However, With the increasing number of patients, early detection is often limited and time-consuming, and subject to individual subjective observation and lack of real-time measurement. Recent strides in brain science have provided critical insights for depression diagnosis (Kayalvizhi et al., 2023; Pethuraj et al., 2023; Vai et al., 2020; Wei et al., 2021; ** and Li, 2023), with techniques like electroencephalogram (EEG) (Altaheri et al., 2022; Shen et al., 2022; Muhammad et al., 2020; Gong et al., 2023) and functional near-infrared spectroscopy (fNIRS) (Ruotsalo et al., 2023; Chao et al., 2021; Zhu et al., 2020; Zheng et al., 2020) becoming increasingly prominent due to their safety, portability, affordability, temporal precision, and minimal environmental demands. Therefore, it is necessary to explore an automatic depression recognition method based on physiological signals to assist the clinical diagnosis of doctors and accelerate the treatment for patients (Muhammad et al., 2021; Hossain et al., 2022, 2023).

The wide collection and analysis of multimodal physiological signals such as fNIRS and EEG provide more potential to combine them to perform mental disease recognition. The distinct sampling mechanisms of fNIRS and EEG pose challenges for direct fusion at the data level, leading to a predominant focus on feature-level fusion strategies in recent research. For example, Pietro et al. employed EEG and fNIRS to classify the four symptoms of Alzheimer’s disease, which achieved higher accuracy by integrating its complementary characteristics compared with single-modal experiments (Cicalese et al., 2020). Shin et al. utilized typical eigenvalue scores and a common spatial pattern method to fuse the fNIRS and EEG feature (Shin et al., 2018). Similarly, Qiu et al. proposed a multimodal feature-level fusion method, achieving good results in the classification of brain activity induced by preference music and neutral music (Qiu et al., 2022). Furthermore, Zhang et al. designed a feature fusion method based on spatio-temporal alignment strategy to obtain a significantly improved classification level in the motor imagery paradigm compared to the non-aligned method (Zhang et al., 2023a). However, focusing only on feature-level fusion for EEG and fNIRS with time series property makes it easy to ignore the spatio-temporal representation and multimodal complementary features. Moreover, the existing studies have not considered the deep semantic information reflected by physiological signals under specific stimulation tasks, such as the activation status of brain regions.

To address the above issues, we propose a Multimodal physiological signals Representation Learning framework via Multiscale Contrasting for depression recognition (MRLMC). This framework employs the Siamese network architecture, which utilizes two encoders with the same structure and shared weights to process different modalities. Specifically, first, fNIRS and EEG are fed into a time-domain data augmentation module to generate different but correlated data. This ensures that MRLMC learns the two types of augmented feature representation of the data. Then, we design a multiscale spatio-temporal convolution (MSC) module to learn the spatio-temporal representation and dynamic characteristics of multimodal physiological signals. The spatio-temporal contrasting module aims to minimize the differences in fNIRS and EEG feature representations while enhancing their complementary nature. Furthermore, we propose a semantic consistency module to further mine the deep semantic information such as the activation status of brain regions. It aims to maximize the semantic similarity of multimodal physiological signals. In summary, the main contributions of this paper include:

  • We propose a multimodal physiological signals representation learning framework using Siamese network architecture via multiscale contrasting for depression recognition. This framework presents a novel approach to handling multimodal physiological signals and provides an objective auxiliary diagnosis.

  • We design a spatio-temporal contrasting module to learn the spatio-temporal representation and dynamic characteristics. Additionally, we propose a semantic consistency module to further learn the semantic consistency representation under stimulation tasks.

  • Extensive experiments are performed on publicly available and self-collected multimodal physiological signals datasets to validate the effectiveness of the MRLMC framework. The results show the superiority of the proposed method for the advancement of depression recognition.

2. Related Work

For EEG-based depression recognition research, Rajendra et al. proposed a convolutional network for EEG data with 15 normal controls and 15 depression patients to perform depression classification and found that the signal in the right hemisphere is more active than the signal in the left hemisphere (Acharya et al., 2018). Shah et al. proposed a NeuCube model based on a pulse network to classify depression and normal controls by neural circuit connections based on EEG signals (Shah et al., 2019). Uddin et al. captured the symptom information by combining recurrent neural networks (RNN) with long short-term memory (LSTM) (Uddin et al., 2022). Recently, Hashempour et al. proposed a hybrid convolutional and temporal-convolutional neural network to continuously estimate the BDI score to achieve depression detection (Hashempour et al., 2022). Peng et al. constructed attentive simple graph convolution network and transformer neural network for depression detection and characterized the alteration of relevant neural patterns in the depressed patients (Peng et al., 2023).

For fNIRS-based depression recognition research, Liu et al. focused on stimulation tasks to investigate the advantages of fNIRS in cognitive activation (Liu et al., 2021). Based on the extracted physiological features, a support vector machine classifier based on LSTM for fNIRS is designed to perform classification tasks. fNIRS data has reliably reflect cognitive profiles on the brain in different stimulation tasks (Midha et al., 2021; Rocco et al., 2021), and presents signal differences under different stimulation task time points (Yu et al., 2020). Wang et al. proposed a transformer-based fNIRS classification network to explore spatial-level and channel-level representations of fNIRS signals to improve data utilization and feature representation (Wang et al., 2022). Similarly, Zhang et al. achieved mild cognitive impairment recognition by exploiting the multidimensional features of fNIRS data including channel, temporal, and spatial features (Zhang et al., 2023b). Wang et al. transformed fNIRS signals into 2-D wavelet feature maps to diagnose depressive disorder (Wang et al., 2023). However, these works mentioned above ignore the nonlinear and segment characteristics of EEG and fNIRS. In addition, ignoring the dynamic characteristics and semantic representation of neural activity under stimulation tasks results in weak classification performance.

Refer to caption
Figure 1. The overview of the MRLMC framework. The MRLMC adopts the Siamese network architecture, composed of multimodal signals input, a spatio-temporal contrasting module and a semantic consistency module.

There are many brain-computer studies on multimodal recognition tasks based on fNIRS and EEG but less research in the area of multimodal depression recognition. He et al. proposed a multimodal multitask neural network model to fuse the EEG and fNIRS signals to achieve motor imagery classification (He et al., 2022a). Gao et al. utilized an EEG-informed fNIRS general linear model to extract common spatial pattern features and the support vector machine was used as the classifier (Gao et al., 2023). Differently, we establish a multimodal contrastive learning framework based on the Siamese network architecture. fNIRS and EEG are fed into the spatio-temporal contrasting module and semantic consistency module to extract complementary features, dynamic features and semantic consistency representations to realize multimodal depression recognition.

3. Methodology

In this section, we describe the components of MRLMC framework in detail. As shown in Figure 1, the MRLMC framework adopts the Siamese network architecture to learn the feature representations of fNIRS and EEG signals. Specifically, we first utilize the time-domain data augmentation method to generate different but correlated data. Then, we design a spatio-temporal contrasting module to extract the feature representation and dynamic characteristics of the physiological signals. Finally, a deep semantic representation of fNIRS and EEG signals is achieved through the semantic consistency module. This multimodal semantic representation is then fused and fed into the classification layer to realize depression recognition.

Refer to caption
Figure 2. The input modes of multimodal signals in MRLMC, including single modal mode and multimodal mode.

3.1. Multimodal Signals Input Modes

The collection of fNIRS and EEG data involves stringent conditions, which present challenges due to limited medical resources and the prevalent stigma associated with patients. Therefore, in scenarios with limited data, the data augmentation method plays an important role, and it is also a key part of realizing single-modal contrasting learning. As shown in Figure 2, when only singlemodal (either fNIRS or EEG) is available, both the raw and augmented data are utilized as pairs. When the input is fNIRS and EEG, they are shaped as a pair of data, with the data augmentation strategy randomly applied to part of the data. The data augmentation method for physiological data should take into account both the collection paradigm and the process. Since the physiological data for patients with depression are mostly collected based on specific stimulation tasks, the methods of time war** and time masking (Shao et al., 2023) are utilized to generate different but correlated data.

Given a sample x𝑥xitalic_x, the time step of the time masking method is [t0,t0+ttm]subscript𝑡0subscript𝑡0subscript𝑡𝑡𝑚[t_{0},t_{0}+t_{tm}][ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT ], where t0[0,tq)subscript𝑡00subscript𝑡𝑞t_{0}\in[0,t_{q})italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 0 , italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ), and the masking parameter ttm(0,λ],λtqformulae-sequencesubscript𝑡𝑡𝑚0𝜆𝜆subscript𝑡𝑞t_{tm}\in(0,\lambda],\lambda\leq t_{q}italic_t start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT ∈ ( 0 , italic_λ ] , italic_λ ≤ italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, introducing an upper bound that the width of the time masking cannot be larger than the response time of each question of stimulation task. Similarly, the time step of the time war** method is [t0,t0+ttw]subscript𝑡0subscript𝑡0subscript𝑡𝑡𝑤[t_{0},t_{0}+t_{tw}][ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_t italic_w end_POSTSUBSCRIPT ], where t0[0,tq)subscript𝑡00subscript𝑡𝑞t_{0}\in[0,t_{q})italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 0 , italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) and the war** parameter ttw(0,λ],λtqformulae-sequencesubscript𝑡𝑡𝑤0𝜆𝜆subscript𝑡𝑞t_{tw}\in(0,\lambda],\lambda\leq t_{q}italic_t start_POSTSUBSCRIPT italic_t italic_w end_POSTSUBSCRIPT ∈ ( 0 , italic_λ ] , italic_λ ≤ italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The augmented data is denoted as xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which has the same time scale as x𝑥xitalic_x. Formally, let DMulti={Fi,Ei}Nsubscript𝐷𝑀𝑢𝑙𝑡𝑖superscriptsubscript𝐹𝑖subscript𝐸𝑖𝑁D_{Multi}=\{F_{i},E_{i}\}^{N}italic_D start_POSTSUBSCRIPT italic_M italic_u italic_l italic_t italic_i end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be a dataset of fNIRS and EEG, such that each fNIRS sample Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to EEG sample Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each input sample Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we denote the augmented data as xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, D={xi,yi}N𝐷superscriptsubscript𝑥𝑖subscript𝑦𝑖𝑁D=\{x_{i},y_{i}\}^{N}italic_D = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT as the input data. In the case of single modal inputs, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote raw data and augmented data respectively and the N𝑁Nitalic_N is the number of samples. Multimodal data is fed into the spatio-temporal contrasting module to extract latent representation.

Refer to caption
Figure 3. The overview of multiscale spatio-temporal convolutional (MSC) network. The input raw data or augmented data undergoes a convolution layer to generate embedding. Then, the spatio-temporal representation is extracted by multiscale convolution.

3.2. Spatio-temporal Contrasting

Physiological signals, as a kind of multichannel time series data, are characterized by spatio-temporal features that are the most important kind of representation. Specific stimulation tasks are usually performed to collect physiological signals. When the participants are handling stimulation tasks, the status of the brain is transformed from a resting state to an activated state. Regarding the time dimension, physiological signals have dynamic changing characteristics. Meanwhile, the prefrontal areas of the individual brain are associated with emotional expression, and different channels have similar but different characteristics. Therefore, we design a spatio-temporal contrasting module, as shown in Figure 1, which utilizes the contrastive loss to minimize the differences between fNIRS and EEG feature representations and maximize complementarity through extracting the spatio-temporal representations of raw data and augmented data. Figure 3 presents the MSC network, which extracts the spatio-temporal representation and dynamic characteristics of physiological signals.

Given an input signal x𝑥xitalic_x, its dimension is NChannel×Tsubscript𝑁𝐶𝑎𝑛𝑛𝑒𝑙𝑇N_{Channel}\times Titalic_N start_POSTSUBSCRIPT italic_C italic_h italic_a italic_n italic_n italic_e italic_l end_POSTSUBSCRIPT × italic_T, where NChannelsubscript𝑁𝐶𝑎𝑛𝑛𝑒𝑙N_{Channel}italic_N start_POSTSUBSCRIPT italic_C italic_h italic_a italic_n italic_n italic_e italic_l end_POSTSUBSCRIPT is the number of channels of data and T𝑇Titalic_T is the collection duration, which is determined by the collection device and the data type. Then, the x𝑥xitalic_x is fed into the encoder to get the latent representation. The encoder based on the convolution layer maps x𝑥xitalic_x into a latent representation C=fenc(x)𝐶subscript𝑓𝑒𝑛𝑐𝑥C=f_{enc}(x)italic_C = italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_x ), Cd𝐶superscript𝑑C\in\mathbb{R}^{d}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the dimension of the feature. Thus, we get C𝐶Citalic_C for the feature representation of a physiological signal, which is then fed into the multiscale convolution layers. The C𝐶Citalic_C is passed to the NScalesubscript𝑁𝑆𝑐𝑎𝑙𝑒N_{Scale}italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT layer multiscale convolution to extract high-dimensional representations Cencsuperscript𝐶𝑒𝑛𝑐C^{enc}italic_C start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT. Then, the representations are fed into NScalesubscript𝑁𝑆𝑐𝑎𝑙𝑒N_{Scale}italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT spatio-temporal feature extraction blocks fblock()subscript𝑓𝑏𝑙𝑜𝑐𝑘f_{block}(\cdot)italic_f start_POSTSUBSCRIPT italic_b italic_l italic_o italic_c italic_k end_POSTSUBSCRIPT ( ⋅ ) to extract spatio-temporal representation of physiological signals. Finally, we get spatio-temporal representation v𝑣vitalic_v of a physiological signal,

(1) v=Concat(φ1,φ2,,φNScale),𝑣𝐶𝑜𝑛𝑐𝑎𝑡subscript𝜑1subscript𝜑2subscript𝜑subscript𝑁𝑆𝑐𝑎𝑙𝑒v=Concat(\varphi_{1},\varphi_{2},\cdots,\varphi_{N_{Scale}}),italic_v = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_φ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where

(2) φi=max(αNorm(fblock(Cienc)),Norm(fblock(Cienc))),subscript𝜑𝑖𝑚𝑎𝑥𝛼𝑁𝑜𝑟𝑚subscript𝑓𝑏𝑙𝑜𝑐𝑘subscriptsuperscript𝐶𝑒𝑛𝑐𝑖𝑁𝑜𝑟𝑚subscript𝑓𝑏𝑙𝑜𝑐𝑘subscriptsuperscript𝐶𝑒𝑛𝑐𝑖\varphi_{i}=max(\alpha\ast Norm(f_{block}(C^{enc}_{i})),Norm(f_{block}(C^{enc}% _{i}))),italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m italic_a italic_x ( italic_α ∗ italic_N italic_o italic_r italic_m ( italic_f start_POSTSUBSCRIPT italic_b italic_l italic_o italic_c italic_k end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , italic_N italic_o italic_r italic_m ( italic_f start_POSTSUBSCRIPT italic_b italic_l italic_o italic_c italic_k end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT italic_e italic_n italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ,

which simplified to v=[φ1,φ2,,φNScale]𝑣subscript𝜑1subscript𝜑2subscript𝜑subscript𝑁𝑆𝑐𝑎𝑙𝑒v=[\varphi_{1},\varphi_{2},\cdots,\varphi_{N_{Scale}}]italic_v = [ italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_φ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], vm𝑣superscript𝑚v\in\mathbb{R}^{m}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where m=NScale×NOut𝑚subscript𝑁𝑆𝑐𝑎𝑙𝑒subscript𝑁𝑂𝑢𝑡m=N_{Scale}\times N_{Out}italic_m = italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_O italic_u italic_t end_POSTSUBSCRIPT is the dimension of feature, NOutsubscript𝑁𝑂𝑢𝑡N_{Out}italic_N start_POSTSUBSCRIPT italic_O italic_u italic_t end_POSTSUBSCRIPT is the output dimension of the spatio-temporal feature extraction blocks and α𝛼\alphaitalic_α is the control weight.

Through the spatio-temporal contrasting module, the multimodal data generate spatio-temporal representations v𝑣vitalic_v and u𝑢uitalic_u, where u𝑢uitalic_u is generated from another modal or augmented data. Given a batch of input samples denoted as N=batch_size𝑁𝑏𝑎𝑡𝑐_𝑠𝑖𝑧𝑒N=batch\_sizeitalic_N = italic_b italic_a italic_t italic_c italic_h _ italic_s italic_i italic_z italic_e, we get 2N2𝑁2N2 italic_N items from fNIRS and EEG. For a u𝑢uitalic_u item, we denote u+superscript𝑢u^{+}italic_u start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as the positive sample for v𝑣vitalic_v, and thus (v,u+)𝑣superscript𝑢(v,u^{+})( italic_v , italic_u start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) are considered as the positive pair. The other (2N2)2𝑁2(2N-2)( 2 italic_N - 2 ) items in the same batch are considered negative samples for v𝑣vitalic_v, then v𝑣vitalic_v forms negative pairs with (2N2)2𝑁2(2N-2)( 2 italic_N - 2 ) negative samples. Therefore, we can define the spatio-temporal contrasting loss to maximize the similarity between positive pairs and the difference between negative pairs.

Given the v𝑣vitalic_v and u𝑢uitalic_u items, we compare the similarity of positive pair (v,n+)𝑣superscript𝑛(v,n^{+})( italic_v , italic_n start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) with the similarity of (2N2)2𝑁2(2N-2)( 2 italic_N - 2 ) negative pairs, the spatio-temporal contrasting loss MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT is defined as follows:

(3) MSC=logexp(sim(v,u+)/τ)exp(sim(v,u+)/τ)+j=12N2exp(sim(v,uj)/τ),subscript𝑀𝑆𝐶𝑒𝑥𝑝𝑠𝑖𝑚𝑣superscript𝑢𝜏𝑒𝑥𝑝𝑠𝑖𝑚𝑣superscript𝑢𝜏superscriptsubscript𝑗12𝑁2𝑒𝑥𝑝𝑠𝑖𝑚𝑣subscript𝑢𝑗𝜏\mathcal{L}_{MSC}=-\log{\frac{exp(sim(v,u^{+})/\tau)}{exp(sim(v,u^{+})/\tau)+{% \textstyle\sum_{j=1}^{2N-2}exp(sim(v,u_{j})/\tau)}}},caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT = - roman_log divide start_ARG italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_v , italic_u start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_v , italic_u start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N - 2 end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_s italic_i italic_m ( italic_v , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ,

where sim()𝑠𝑖𝑚sim(\cdot)italic_s italic_i italic_m ( ⋅ ) denotes cosine similarity,

(4) sim(v,u)=vTuvu,𝑠𝑖𝑚𝑣𝑢superscript𝑣𝑇𝑢norm𝑣norm𝑢sim(v,u)=\frac{v^{T}u}{\left\|v\right\|\left\|u\right\|},italic_s italic_i italic_m ( italic_v , italic_u ) = divide start_ARG italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_u end_ARG start_ARG ∥ italic_v ∥ ∥ italic_u ∥ end_ARG ,

where τ𝜏\tauitalic_τ is a temperature parameter. Through the spatio-temporal contrasting loss MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT, the differences of feature representations between fNIRS and EEG could be minimized, which also maximizes the complementarity of these two representations. And then the spatio-temporal representations are fed into the semantic consistency module to further learn deep semantic information.

3.3. Semantic Consistency

The depression patients are characterized by persistent low mood, pleasure deficit, and cognitive impairment, which presents the difference with the control group on the brain activity level and activation state when performing the stimulation task (Shao et al., 2023). fNIRS and EEG reflect brain activation state by detecting slight changes in brain activity, so it is necessary to mine deeper semantic information that can reflect brain activation state. We propose a semantic consistency module to maximize the semantic similarity of multimodal physiological signals and further mine deep semantic information such as brain activation state.

Refer to caption
Figure 4. The architecture of transformer unit in semantic consistency module.

We utilize the transformer unit as the semantic feature extraction model because of its context-awareness. The architecture of the transformer unit is shown in Figure 4, which mainly consists of successive blocks of multi-head attention (MHAttn) and MLP. The MLP block consists of two fully connected layers and a non-linear ReLU. The transformer unit is defined by the following equations:

(5) MHAttn(Q,K,V)=Concat(Head1,Head2,,HeadNHead)𝒲O𝑀𝐻𝐴𝑡𝑡𝑛𝑄𝐾𝑉𝐶𝑜𝑛𝑐𝑎𝑡𝐻𝑒𝑎subscript𝑑1𝐻𝑒𝑎subscript𝑑2𝐻𝑒𝑎subscript𝑑subscript𝑁𝐻𝑒𝑎𝑑superscript𝒲𝑂MHAttn(Q,K,V)=Concat(Head_{1},Head_{2},\cdots,Head_{N_{Head}})\mathcal{W}^{O}italic_M italic_H italic_A italic_t italic_t italic_n ( italic_Q , italic_K , italic_V ) = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_H italic_e italic_a italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) caligraphic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT

where Q𝑄Qitalic_Q represents the input feature vector, K𝐾Kitalic_K represents the key vector, V𝑉Vitalic_V denotes the value vector, NHeadsubscript𝑁𝐻𝑒𝑎𝑑N_{Head}italic_N start_POSTSUBSCRIPT italic_H italic_e italic_a italic_d end_POSTSUBSCRIPT represents the number of heads, and 𝒲Osuperscript𝒲𝑂\mathcal{W}^{O}caligraphic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denotes the final output weights. Headi𝐻𝑒𝑎subscript𝑑𝑖Head_{i}italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as follows:

(6) Headi=Attention(Q𝒲iQ,K𝒲iK,V𝒲iV)𝐻𝑒𝑎subscript𝑑𝑖𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄subscriptsuperscript𝒲𝑄𝑖𝐾subscriptsuperscript𝒲𝐾𝑖𝑉subscriptsuperscript𝒲𝑉𝑖Head_{i}=Attention(Q\mathcal{W}^{Q}_{i},K\mathcal{W}^{K}_{i},V\mathcal{W}^{V}_% {i})italic_H italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q caligraphic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K caligraphic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V caligraphic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where 𝒲iQ,𝒲iK,𝒲iVsubscriptsuperscript𝒲𝑄𝑖subscriptsuperscript𝒲𝐾𝑖subscriptsuperscript𝒲𝑉𝑖\mathcal{W}^{Q}_{i},\mathcal{W}^{K}_{i},\mathcal{W}^{V}_{i}caligraphic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the weight matrics of Q,K,V𝑄𝐾𝑉Q,K,Vitalic_Q , italic_K , italic_V, respectively. Attention()𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛Attention(\cdot)italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( ⋅ ) is define as

(7) Attention(Q,K,V)=softmax(QKTdK)V𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇subscript𝑑𝐾𝑉Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{K}}})Vitalic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V

where dKsubscript𝑑𝐾d_{K}italic_d start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT denotes the dimentional size of vector K𝐾Kitalic_K. Given the spatio-temporal representations v𝑣vitalic_v, we pass it through the transformer unit as follows:

(8) ψi=MHAttn(Norm(vi1))+ψi1,1iNTrans,formulae-sequencesubscript𝜓𝑖𝑀𝐻𝐴𝑡𝑡𝑛𝑁𝑜𝑟𝑚subscript𝑣𝑖1subscript𝜓𝑖11𝑖subscript𝑁𝑇𝑟𝑎𝑛𝑠\psi_{i}=MHAttn(Norm(v_{i-1}))+\psi_{i-1},1\leq i\leq N_{Trans},italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M italic_H italic_A italic_t italic_t italic_n ( italic_N italic_o italic_r italic_m ( italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) + italic_ψ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT italic_T italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ,

and then the ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is input to the MLP block:

(9) zi=MLP(Norm(ψi))+ψi,1iNTrans,formulae-sequencesubscript𝑧𝑖𝑀𝐿𝑃𝑁𝑜𝑟𝑚subscript𝜓𝑖subscript𝜓𝑖1𝑖subscript𝑁𝑇𝑟𝑎𝑛𝑠z_{i}=MLP(Norm(\psi_{i}))+\psi_{i},1\leq i\leq N_{Trans},italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_N italic_o italic_r italic_m ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT italic_T italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ,

where NTranssubscript𝑁𝑇𝑟𝑎𝑛𝑠N_{Trans}italic_N start_POSTSUBSCRIPT italic_T italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT denotes the number layers stacked to generate the final feature z𝑧zitalic_z.

Given the multimodal spatio-temporal representations v𝑣vitalic_v and u𝑢uitalic_u, a multilayer stacked transformer unit is utilized to extract the semantic feature zfsuperscript𝑧𝑓z^{f}italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and zesuperscript𝑧𝑒z^{e}italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. The dimension size of zfsuperscript𝑧𝑓z^{f}italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and zesuperscript𝑧𝑒z^{e}italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are the same as v𝑣vitalic_v and u𝑢uitalic_u. We utilize cosine similarity as the semantic consistency loss to maximize the semantic similarity of multimodal physiological signals. The semantic consistency loss can be denoted as follows:

(10) SC=sim(zf,ze).subscript𝑆𝐶𝑠𝑖𝑚superscript𝑧𝑓superscript𝑧𝑒\mathcal{L}_{SC}=sim(z^{f},z^{e}).caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT = italic_s italic_i italic_m ( italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) .

3.4. Depression Recognition

Ultimately, zfsuperscript𝑧𝑓z^{f}italic_z start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and zesuperscript𝑧𝑒z^{e}italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are concatenated and fed into the two fully connected layers and the ReLU layer for multimodal depression recognition. In real healthcare scenarios, the collected dataset exists the class imbalance problem, so the focal loss function is utilized to perform depression recognition, which is defined as follows:

(11) FL=α(1P)γlog(P),subscript𝐹𝐿𝛼superscript1𝑃𝛾𝑙𝑜𝑔𝑃\mathcal{L}_{FL}=-\alpha(1-P)^{\gamma}log(P),caligraphic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT = - italic_α ( 1 - italic_P ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_l italic_o italic_g ( italic_P ) ,

where P𝑃Pitalic_P denotes the predictive probability of the model, α𝛼\alphaitalic_α is the weighting factor to balance the positive and negative samples, and γ𝛾\gammaitalic_γ is the adjustable parameter. The adjustment factor (1P)γsuperscript1𝑃𝛾(1-P)^{\gamma}( 1 - italic_P ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT can be adjusted adaptively according to the difficulty of the sample. In instances where samples are inherently easier to classify, the parameter P𝑃Pitalic_P is larger, causing the adjustment factor to tend to zero. Consequently, this results in a reduced impact on the loss function, prompting the model to focus more on samples that are difficult to classify. The overall loss is the combination of the spatio-temporal contrasting loss, semantic consistency loss, and classification loss as follows:

(12) =λ1MSC+λ2SC+FL,subscript𝜆1subscript𝑀𝑆𝐶subscript𝜆2subscript𝑆𝐶subscript𝐹𝐿\mathcal{L}=\lambda_{1}\mathcal{L}_{MSC}+\lambda_{2}\mathcal{L}_{SC}+\mathcal{% L}_{FL},caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT ,

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are fixed scalar hyperparameters denoting the relative weight of each loss.

4. Experiments

The datasets and implementation details are first presented in this section. We then conducted extensive experiments to validate the effectiveness of the MRLMC framework.

4.1. Datasets Description

To evaluate the performance of our proposed method, we conduct a series of experiments on two datasets.

MODMA dataset (Li et al., 2018) is a publicly available dataset, and we only use event-related EEG data, including 53 participants (24 outpatients diagnosed with depression and 29 healthy controls). It uses a Dot-probe stimulation task to record EEG signals. The Dot-probe is composed of facial pictures from the standardized native Chinese Facial Affective Picture System (Lu et al., 2005). The facial pictures are classified into four sets as fear, sad, happy, and neutral emotions based on their valence. Any two facial images of different valences appear on the screen. During the experiment, participants were asked to focus on the screen and watch freely with their eyes. When the dot appeared, they were asked to press the button quickly and accurately without making any body movements, including head or legs, and as much as possible without making unnecessary eye movements, glances and blinks. Continuous EEG signals were recorded using a 128-channel device. The sampling frequency was 250 Hz.

Refer to caption
Figure 5. The channel location of fNIRS and EEG. Among them, orange is 16 NIR emitters, blue is 16 NIR receivers, green is 53 fNIRS channels, and purple is 16 EEG channels.

fNIRS-EEG dataset is a self-collected multimodal physiological signals dataset, including fNIRS and EEG signals. We utilize a verbal fluency stimulation task to record data, including 96 participants (79 depression patients and 17 healthy controls) for only fNIRS, and 64 participants (52 depression patients and 12 healthy controls) for both fNIRS and EEG. During the data collection process, doctors helped participants wear the device to ensure the probe was tightly attached to the scalp until the channel pass rate reached 80%. The entire stimulation task includes a pre-task silence period, a task period and a post-task silence period. The silent period required participants to sit up straight in front of the computer, remain calm, and not shake their bodies. During the task period, three questions appear on the computer screen, and participants are asked to name the fruits, appliances and vegetables they can associate with the questions. As shown in Figure 5, the near-infrared device used in this study has 16 near-infrared (NIR) emission probes and receiving probes, and a total of 53 channels are connected. The detector emits near-infrared light at 690nm and 830nm. Throughout the test period, the NIR device collected the intensity of the emitted light at two wavelengths at a sampling frequency of 100hz. Through the test, each participant had data of 150×\times×100×\times×53×\times×2, where 150 is the duration of the test, 100 is the data collection frequency, 53 is the number of channels and 2 is the number of wavelengths. The EEG device used in this study has 16 channels, and the electrode-wearing method follows the 10-20 lead system standard. The EEG device collects electrical signals at a sampling frequency of 1000hz, with data for each participant of 150×\times×1000×\times×16, where 150 is the duration of the test, 1000 is the data collection frequency, 16 is the number of channels.

4.2. Implementation Details

4.2.1. Experimental Setup

The entire dataset is randomly split into training set, testing set and validation set for each training phase. The final modal used in testing is the one that exhibits the best performance on the validation set. For the evaluation of depression diagnosis, the macro Accuracy, Precision, Recall and F1-score are used as evaluation indicators for the performance of the model. Multiple experiments were conducted to take the average value of the evaluation indicators.

4.2.2. Data Preprocessing

For the MODMA dataset, we utilized the EEGLAB toolkit (Delorme and Makeig, 2004) within the MATLAB platform for EEG denoising. We applied a bandpass filter with a frequency range of 1-40 Hz to the raw EEG signal. Then, we utilized the extended ICA algorithm to obtain multiple independent EEG components to eliminate artifacts and noise components, such as electrooculogram (EOG), ECG, EMG, and eye movement. Finally, the ICLabel plugin was used to remove the identified artifacts and noise components. Additionally, we only selected part of the channel data from the prefrontal brain area, which processes emotional expression.

For the fNIRS-EEG dataset, we first utilized the near-infrared data analysis tools for fNIRS data preprocessing. The preprocessing steps begin with the elimination of motion artifacts unrelated to the raw data using the temporal derivative distribution repair method. Subsequently, the light intensity signal was converted into an optical density profile, which was then filtered using the finite impulse response band-pass filter with 0.01-0.08Hz to eliminate noise caused by physiological fluctuations such as pulse and respiration and baseline drift caused by environmental and temperature changes. Finally, the optical density data were converted to concentration change of oxygenated hemoglobin (HbO) and deoxyhemoglobin (HbR) using a modified Beer-Lambert method. Based on fNIRS-based research (Zhu et al., 2020; Chao et al., 2021; Han et al., 2022), this study also deliberately focused on the HbO concentration change data in subsequent method design. Additionally, we only selected part of the channels, which are the red font channels shown in Figure 5. For the EEG signals, the same preprocessing method was implemented. Resampling was implemented for both fNIRS and EEG data.

Table 1. Model configuration parameters.
Parameters Values
Learning rate 1e31𝑒31e-31 italic_e - 3
Batch size 16
Dropout 0.1
NScalesubscript𝑁𝑆𝑐𝑎𝑙𝑒N_{Scale}italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT 5
NTranssubscript𝑁𝑇𝑟𝑎𝑛𝑠N_{Trans}italic_N start_POSTSUBSCRIPT italic_T italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT 1
NHeadsubscript𝑁𝐻𝑒𝑎𝑑N_{Head}italic_N start_POSTSUBSCRIPT italic_H italic_e italic_a italic_d end_POSTSUBSCRIPT 16

4.2.3. Model Configuration

The model is constructed using the Pytorch framework and optimized using the RMSprop optimizer. The learning rate, batch size and other parameters are shown in Table 6.

4.3. Experimental Results

Table 2. Comparison of MRLMC model with baseline methods on MODMA dataset.
Model Acc. Prec. Rec. F1.
EEGNet (Lawhern et al., 2018) 0.568 - 0.668 0.600
STGCN (Yu et al., 2018) 0.588 - 0.577 0.596
DGCNN (Song et al., 2018) 0.597 - 0.459 0.552
HGP-SL (Zhang et al., 2019) 0.585 0.536 0.625 0.577
SAGE (Li et al., 2019) 0.679 0.640 0.667 0.653
SST-Emotionnet (Jia et al., 2020) 0.736 0.692 0.750 0.720
CGIPool (Pang et al., 2021) 0.736 0.692 0.750 0.720
SGP-SL (Chen et al., 2022) 0.849 0.808 0.875 0.840
TSception (Ding et al., 2022) 0.544 - 0.445 0.486
CLG (Shen et al., 2023a) 0.765 - 0.757 0.759
dFL (Shen et al., 2023b) 0.750 - 0.614 -
1DEEG- Transformer (Qayyum et al., 2023) 0.782 0.784 0.692 0.749
MRLMC 0.867 0.875 0.875 0.864

4.3.1. EEG Depression Recognition

To demonstrate the effectiveness of the MRLMC model and its applicability on single modal modes, we first conducted sufficient experiments on the MODMA dataset. The benchmark algorithms include EEGNet (Lawhern et al., 2018), STGCN (Yu et al., 2018), DGCNN (Song et al., 2018), HGP-SL (Zhang et al., 2019), SAGE (Li et al., 2019), SST-Emotionnet (Jia et al., 2020), SGP-SL (Chen et al., 2022), CGIPool (Pang et al., 2021), SGP-SL (Chen et al., 2022), TSception (Ding et al., 2022), CLG (Shen et al., 2023a), dFL (Shen et al., 2023b) and 1DEEG-Transformer (Qayyum et al., 2023) for comparison. All models utilize the raw EEG signals. Table 2 exhibits the evaluation indicators for each model. For EEG-based depression recognition, the MRLMC model attains the most superior performance with 0.867, 0.875, 0.875, and 0.864 in accuracy, precision, recall, and F1-score, respectively. Specifically, the highest recognition accuracy 0.867 was obtained by MRLMC. EEGNet is the most classic convolutional neural network for processing EEG signals, which uses temporal and spatial convolution to extract data features. The CLG and 1DEEG-Transformer stack back and forth the convolutional layers and long short term memory network to extract temporal and spatial features. Differently, the MRLMC model designs an MSC network to extract the spatio-temporal representation and learns effective feature based on the contrastive loss function, thereby achieving the most advanced classification performance. Compared with SGP-SL, the recognition accuracy of the MRLMC model is improved by 2%. With the latest research such as the CLG and 1DEEG-Transformer, the recognition accuracy is improved by 11%. In addition, the DGCNN and CGIPool models construct the extracted features into a graph structure and mine the relationships between the channels of data. Based on existing research, it has been shown that the prefrontal lobe area of the brain performs emotional expression, which is gradually activated when a stimulation task is performed. Therefore, we implemented a channel selection process before feature extraction. Compared to the DGCNN and CGIPool networks, the MRLMC model improves by 18% in accuracy since the proposed MSC module can also extract channel features. Especially, we also mine the deep semantic information of the data, aiming to mine semantic features such as brain activation levels, and maximize the semantic representation of multimodal data based on consistency loss.

4.3.2. fNIRS Depression Recognition

Table 3. Comparison of MRLMC model with baseline methods on fNIRS-EEG dataset (only fNIRS).
Model Acc. Prec. Rec. F1.
LR 0.813 0.300 0.583 0.355
KNN 0.729 0.188 0.219 0.188
SVM (Song et al., 2014) 0.823 0.000 0.000 0.000
AlexNet (Krizhevsky et al., 2012) 0.830 0.790 0.830 0.800
ResNet (He et al., 2016) 0.720 0.670 0.720 0.700
RF (Zhu et al., 2020) 0.833 0.625 0.175 0.267
XGB (Zhu et al., 2020) 0.833 0.525 0.413 0.446
Corr-AlexNet (Wang et al., 2021) 0.900 0.910 0.900 0.880
GCN (Yu et al., 2022) 0.854 0.700 0.488 0.563
Diffpool (Yu et al., 2022) 0.875 0.750 0.475 0.571
MRLMC 0.913 0.827 0.908 0.834

Table 3 shows the performance of the MRLMC model on fNIRS data in the fNIRS-EEG dataset. To evaluate the superiority of our method, the baseline methods selected are Logistic Regression (LR), K-Nearest Neighbor (KNN), Support Vector Machine (SVM) (Song et al., 2014), AlexNet (Krizhevsky et al., 2012), Residual Network (ResNet) (He et al., 2016), Random Forest (RF) (Zhu et al., 2020), XGB (Zhu et al., 2020), Corr-AlexNet (Wang et al., 2021), GCN (Yu et al., 2022) and Diffpool (Yu et al., 2022). Our proposed method achieved 0.913, 0.827, 0.908 and 0.834 in accuracy, precision, recall and F1-score, respectively, which are satisfactory results. The accuracy of traditional machine learning methods such as LR, KNN and SVM is not satisfactory, while the accuracy of deep learning algorithms such as AlexNet is relatively improved, which highlights the superior performance of deep learning algorithms in depression recognition based on physiological signals. The Corr-AlexNet, GCN and Diffpool networks compared to traditional machine learning improve the accuracy by about 8%. These methods rely on manually extracted features for learning and lack deep exploration of spatio-temporal representation, dynamic features, and semantic representation. The MRLMC model extracts the spatio-temporal representation and dynamic features of the data through the spatio-temporal contrasting module. Additionally, the main symptoms of patients with depression include low mood and slow thinking, which causes their brains to be activated differently when performing stimulating tasks. The MRLMC model utilizes the semantic consistency module to dig deep into the semantic representation to reflect brain activation states. Compared with traditional machine learning, the accuracy is improved by about 11%, and compared with the method of manually extracting features for recognition, the accuracy is improved by about 1.5%. Overall, based on task-state physiological data, extracting spatio-temporal representation and semantic representation can achieve higher recognition accuracy.

4.3.3. Multimodal Depression Recognition

Table 4. Extensive experiments of MRLMC model on fNIRS-EEG dataset.
fNIRS EEG Aug. Acc. Prec. Rec. F1.
×\times× 0.907 0.816 0.839 0.802
×\times× 0.875 0.834 0.822 0.771
×\times× 0.907 0.836 0.875 0.816
0.917 0.850 0.881 0.831

Table 4 exhibits the recognition results of the MRLMC model on the fNIRS-EEG dataset. The excellent results were achieved based on both fNIRS and EEG, with accuracy, precision, recall and F1-score reaching 0.917, 0.850, 0.881 and 0.831, respectively. When only based on fNIRS or EEG, the recognition accuracy reaches 0.907 and 0.875 respectively. Relying on single modal physiological signal for depression recognition, the recognition accuracy is limited by the available feature representations of data. When utilizing multimodal physiological signals, the classification performance improves by 3%. When continuing to perform the data augmentation method, the evaluation indicators all improved. fNIRS collects HbO concentration change data and EEG is an electric signal, and there are complementary features between them. The MRLMC model utilizes spatio-temporal contrasting module to learn the complementary feature representations of multimodal data. Subsequently, the proposed semantic consistency module extracts the semantic features of multimodal physiological signals, such as the degree of brain activation, which are jointly learned through consistency loss. Considering the challenge of class imbalance in real diagnosis and treatment environments, we use the focal loss function to construct a classification network, which enhances the robustness of the network to achieve higher recognition accuracy. The MRLMC model proved effective even for small-scale datasets.

Refer to caption
Figure 6. The visualization of the distribution of features extracted by each module of the proposed model. (a) and (b) are the representations of fNIRS and EEG extracted by the spatio-temporal contrasting module. (c) and (d) are the semantic features of fNIRS and EEG extracted by the semantic consistency module.

To intuitively demonstrate the effectiveness and feature representation capabilities of the various modules in the MRLMC model, Figure 6 displays the distribution of features extracted by each module on fNIRS-EEG dataset. Figure 6 (a) and (b) demonstrate the distribution of representations of fNIRS and EEG extracted by the spatio-temporal contrasting module, albeit not completely separable. Figure 6 (a) and (b) represent the distribution of semantic features extracted by the semantic consistency module, at which point the MRLMC model can accomplish depression recognition.

Table 5. Results of loss terms ablation experiments in each proposed module.
MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT SCsubscript𝑆𝐶\mathcal{L}_{SC}caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT FLsubscript𝐹𝐿\mathcal{L}_{FL}caligraphic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT Acc. Prec. Rec. F1.
×\times× ×\times× 0.800 0.527 0.543 0.533
×\times× 0.891 0.777 0.723 0.740
×\times× 0.875 0.770 0.714 0.723
0.917 0.850 0.881 0.831

4.3.4. Ablation Analysis

To verify the effectiveness of different modules in our proposed model, we conduct additional ablation experiments on the fNIRS-EEG dataset, as shown in Table 5. MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT and SCsubscript𝑆𝐶\mathcal{L}_{SC}caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT are the loss functions applied by the spatio-temporal contrasting and semantic consistency modules respectively. FLsubscript𝐹𝐿\mathcal{L}_{FL}caligraphic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT is the depression recognition loss, which is utilized in all experiments. The results indicate that satisfactory performance is obtained when utilizing all losses. The performance of using MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT or SCsubscript𝑆𝐶\mathcal{L}_{SC}caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT alone is better than using only recognition loss. This proves that our proposed MSCsubscript𝑀𝑆𝐶\mathcal{L}_{MSC}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_C end_POSTSUBSCRIPT and SCsubscript𝑆𝐶\mathcal{L}_{SC}caligraphic_L start_POSTSUBSCRIPT italic_S italic_C end_POSTSUBSCRIPT can help the model obtain useful spatio-temporal representation and semantic information. This means that the spatio-temporal contrasting and semantic consistency modules are effective for multi-modal physiological signals for depression recognition.

4.3.5. Parameter Analysis

Table 6. Performance of MRLMC model with different parameters on fNIRS-EEG dataset.
NScalesubscript𝑁𝑆𝑐𝑎𝑙𝑒N_{Scale}italic_N start_POSTSUBSCRIPT italic_S italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT NTranssubscript𝑁𝑇𝑟𝑎𝑛𝑠N_{Trans}italic_N start_POSTSUBSCRIPT italic_T italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT NHeadsubscript𝑁𝐻𝑒𝑎𝑑N_{Head}italic_N start_POSTSUBSCRIPT italic_H italic_e italic_a italic_d end_POSTSUBSCRIPT Acc. Prec. Rec. F1.
4 1 16 0.907 0.815 0.839 0.806
5 1 16 0.917 0.850 0.881 0.831
6 1 16 0.917 0.838 0.809 0.804
5 2 16 0.891 0.786 0.777 0.764
5 3 16 0.875 0.530 0.571 0.549
5 1 4 0.792 0.661 0.738 0.657
5 1 8 0.900 0.795 0.814 0.787
5 1 32 0.896 0.781 0.798 0.775

To further investigate the MRLMC model, we analyze in detail the impact of several important parameters of the model on performance in this section. Table 6 exhibits the performance of the MRLMC model with different parameters on the fNIRS-EEG dataset. The first three rows of indicators verify the effects of the number of spatio-temporal convolution blocks on model performance, the middle two rows verify the effects of the number of transformer units, and the last two rows verify the effects of multi-head attention. The results show that different parameters have different effects on the model. When the number of convolution blocks is 5 or 6, the recognition accuracy reaches excellent results. The number of transformer encoder units has a slightly greater impact on the performance of depression recognition. As the number of units increases, the network complexity increases, which causes overfitting of the model. In addition, information may be lost or confused during the transmission process, making it difficult for the network to learn useful semantic information. When the number of multihead attention is 8 or 16, the model achieves superior performance.

Refer to caption
Figure 7. Performance of MRLMC model with the different number of spatio-temporal convolution block on fNIRS-EEG dataset. The shadow part represents the superior performance.

Figure 7 shows the performance of MRLMC model with the different number of spatio-temporal convolution blocks on the fNIRS-EEG dataset. As the number of convolution blocks increases, the recognition accuracy decreases, which proves that spatiotemporal representation has a great impact on the recognition performance for small-scale datasets. The increased number of blocks means that the complexity of the model increases. The main characteristics of multimodal physiological signals are their spatio-temporal representation and dynamic variability, and their key information is often hidden in local sequence patterns and global temporal dependence. Networks with high complexity may fail to capture these key information, making it difficult to learn effective spatio-temporal representation. Therefore, for the small-scale fNIRS-EEG dataset, the results of spatio-temporal convolution blocks of 5 or 6 are most excellent. If the MRLMC model is to be transferred to other downstream tasks of multimodal time series, the number of spatio-temporal convolution blocks needs to be determined based on the characteristics of the data.

5. Conclusion

In this paper, we propose a multimodal physiological signals representation learning framework via multiscale contrasting for depression recognition. The Siamese network architecture is utilized to maximize the complementarity between multimodal data. We design multiscale spatio-temporal convolution to obtain more discriminative spatio-temporal representations and dynamic features. The spatio-temporal contrasting module aims to minimize the feature representation and maximize the complementarity of fNIRS and EEG. Meanwhile, the semantic consistency module captures contextual information and the deep semantic information of the data to maximize the semantic representation of multimodal data based on semantic consistency loss. Extensive experiments are implemented on MODMA and fNIRS-EEG datasets, and our proposed model achieves state-of-the-art performance on both singlemodal and multimodal data. Moreover, the analysis of the feature distribution and key parameters of each module shows that each module plays an important role in mining spatio-temporal representations and semantic features. Notably, the proposed model is a generalized architecture based on multichannel physiological signals, which can be extended to other mental disorders and cognitive ability recognition in the future.

Acknowledgements.
This work was supported by National Natural Science Foundation of China (NSFC) under No. 62176101, No. 62272178.

References

  • (1)
  • Acharya et al. (2018) U Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, Hojjat Adeli, and D Puthankattil Subha. 2018. Automated EEG-based screening of depression using deep convolutional neural network. Computer Methods and Programs in Biomedicine 161 (2018), 103–113.
  • Altaheri et al. (2022) Hamdi Altaheri, Ghulam Muhammad, and Mansour Alsulaiman. 2022. Physics-informed attention temporal convolutional network for EEG-based motor imagery classification. IEEE Transactions on Industrial Informatics 19, 2 (2022), 2249–2258.
  • Chao et al. (2021) **long Chao, Shuzhen Zheng, Hongtong Wu, Dixin Wang, Xuan Zhang, Hong Peng, and Bin Hu. 2021. fNIRS evidence for distinguishing patients with major depression and healthy controls. IEEE Transactions on Neural Systems and Rehabilitation Engineering 29 (2021), 2211–2221.
  • Chen et al. (2022) Tao Chen, Yanrong Guo, Shijie Hao, and Richang Hong. 2022. Exploring self-attention graph pooling with EEG-based topological structure and soft label for depression detection. IEEE Transactions on Affective Computing 13, 4 (2022), 2106–2118.
  • Cicalese et al. (2020) Pietro A Cicalese, Rihui Li, Mohammad B Ahmadi, Chushan Wang, Joseph T Francis, Sudhakar Selvaraj, Paul E Schulz, and Yingchun Zhang. 2020. An EEG-fNIRS hybridization technique in the four-class classification of alzheimer’s disease. Journal of Neuroscience Methods 336 (2020), 108618.
  • Delorme and Makeig (2004) Arnaud Delorme and Scott Makeig. 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods 134, 1 (2004), 9–21.
  • Ding et al. (2022) Yi Ding, Neethu Robinson, Su Zhang, Qiuhao Zeng, and Cuntai Guan. 2022. Tsception: Capturing temporal dynamics and spatial asymmetry from eeg for emotion recognition. IEEE Transactions on Affective Computing (2022).
  • Gao et al. (2023) Yunyuan Gao, Biao Jia, Michael Houston, and Yingchun Zhang. 2023. Hybrid EEG-fNIRS Brain Computer Interface Based on Common Spatial Pattern by Using EEG-Informed General Linear Model. IEEE Transactions on Instrumentation and Measurement 72 (2023), 1–10.
  • Gong et al. (2023) Peiliang Gong, Ziyu Jia, Pengpai Wang, Yueying Zhou, and Daoqiang Zhang. 2023. ASTDF-Net: Attention-Based Spatial-Temporal Dual-Stream Fusion Network for EEG-Based Emotion Recognition. In Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 883–892.
  • Han et al. (2022) Jianda Han, Jiewei Lu, Jianeng Lin, Song Zhang, and Ningbo Yu. 2022. A Functional Region Decomposition Method to Enhance fNIRS Classification of Mental States. IEEE Journal of Biomedical and Health Informatics 26, 11 (2022), 5674–5683.
  • Hashempour et al. (2022) S. Hashempour, R. Boostani, M. Mohammadi, and S. Sanei. 2022. Continuous Scoring of Depression From EEG Signals via a Hybrid of Convolutional Neural Networks. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30 (2022), 176–183.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
  • He et al. (2022b) Lang He, Chenguang Guo, Prayag Tiwari, Hari Mohan Pandey, and Wei Dang. 2022b. Intelligent system for depression scale estimation with facial expressions and case study in industrial intelligence. International Journal of Intelligent Systems 37, 12 (2022), 10140–10156.
  • He et al. (2022a) Qun He, Lufeng Feng, Guoqian Jiang, and ** Xie. 2022a. Multimodal Multitask Neural Network for Motor Imagery Classification With EEG and fNIRS Signals. IEEE Sensors Journal 22, 21 (2022), 20695–20706.
  • Hossain et al. (2022) M Shamim Hossain, Josu Bilbao, Diana P Tobón, Ghulam Muhammad, and Abdulmotaleb El Saddik. 2022. Special issue deep learning for multimedia healthcare. Multimedia Systems 28, 4 (2022), 1147–1150.
  • Hossain et al. (2023) M Shamim Hossain, Josu Bilbao, Diana P Tobón, and Abdulmotaleb El Saddik. 2023. Advances of machine learning in IoT-cloud for healthcare. Computing 105, 4 (2023), 741–742.
  • Jia et al. (2020) Ziyu Jia, Youfang Lin, Xiyang Cai, Haobin Chen, Haijun Gou, and **g Wang. 2020. Sst-emotionnet: Spatial-spectral-temporal based attention 3d dense network for eeg emotion recognition. In ACM International Conference on Multimedia (MM’20). 2909–2917.
  • ** and Li (2023) Ming ** and **peng Li. 2023. Graph to Grid: Learning Deep Representations for Multimodal Emotion Recognition. In 31st ACM International Conference on Multimedia. 5985–5993.
  • Kayalvizhi et al. (2023) S Kayalvizhi, S Nagarajan, J Deepa, and K Hemapriya. 2023. Multi-modal IoT-based medical data processing for disease diagnosis using Heuristic-derived deep learning. Biomedical Signal Processing and Control 85 (2023), 104889.
  • Khazanov et al. (2020) Gabriela K Khazanov, Colin Xu, Barnaby D Dunn, Zachary D Cohen, Robert J DeRubeis, and Steven D Hollon. 2020. Distress and anhedonia as predictors of depression treatment outcome: A secondary analysis of a randomized clinical trial. Behaviour Research and Therapy 125 (2020), 103507.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012).
  • Kroenke and Spitzer (2002) Kurt Kroenke and Robert L Spitzer. 2002. The PHQ-9: a new depression diagnostic and severity measure. , 509–515 pages.
  • Lawhern et al. (2018) Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. 2018. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces. Journal of Neural Engineering 15, 5 (2018), 056013.
  • Li et al. (2019) Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. Semi-supervised graph classification: A hierarchical graph perspective. In The World Wide Web Conference. 972–982.
  • Li et al. (2018) Xiaowei Li, Jianxiu Li, Bin Hu, **g Zhu, Xuemin Zhang, Liuqing Wei, Ning Zhong, Mi Li, Zhijie Ding, **g Yang, and Lan Zhang. 2018. Attentional bias in MDD: ERP components analysis and classification using a dot-probe task. Computer Methods and Programs in Biomedicine 164 (2018), 169–179.
  • Liu et al. (2021) **rui Liu, Ting Song, Zhilin Shu, Jianda Han, and Ningbo Yu. 2021. fNIRS feature extraction and classification in grip-force tasks. In 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 1087–1091.
  • Lu et al. (2005) Bai Lu, MA Hui, and Huang Yu-Xia. 2005. The Development of Native Chinese Affective Picture System–A pretest in 46 College Students. Chinese Mental Health Journal (2005).
  • Malhi and Mann (2018) Gin S Malhi and J John Mann. 2018. Depression. The Lancet 392, 10161 (2018), 0140–6736.
  • Midha et al. (2021) Serena Midha, Horia A Maior, Max L Wilson, and Sarah Sharples. 2021. Measuring mental workload variations in office work tasks using fNIRS. International Journal of Human-Computer Studies 147 (2021), 102580.
  • Muhammad et al. (2021) Ghulam Muhammad, Fatima Alshehri, Fakhri Karray, Abdulmotaleb El Saddik, Mansour Alsulaiman, and Tiago H Falk. 2021. A comprehensive survey on multimodal medical signals fusion for smart healthcare systems. Information Fusion 76 (2021), 355–375.
  • Muhammad et al. (2020) Ghulam Muhammad, M Shamim Hossain, and Neeraj Kumar. 2020. EEG-based pathology detection for home health monitoring. IEEE Journal on Selected Areas in Communications 39, 2 (2020), 603–610.
  • Organization (2022) World Health Organization. 2022. Wake-up call to all countries to step up mental health services and support.
  • Organization (2023) World Health Organization. 2023. Depressive disorder (depression). https://www.who.int/zh/news-room/fact-sheets/detail/depression
  • Pang et al. (2021) Yunsheng Pang, Yunxiang Zhao, and Dongsheng Li. 2021. Graph pooling via coarsened graph infomax. In 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2181.
  • Peng et al. (2023) Dan Peng, Wei Liu, Yun Luo, Ziyu Mao, Wei-Long Zheng, and Bao-Liang Lu. 2023. Deep Depression Detection with Resting-State and Cognitive-Task EEG. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 1–4.
  • Pethuraj et al. (2023) Mohamed Shakeel Pethuraj, MA Burhanuddin, and V Brindha Devi. 2023. Improving accuracy of medical data handling and processing using DCAF for IoT-based healthcare scenarios. Biomedical Signal Processing and Control 86 (2023), 105294.
  • Qayyum et al. (2023) Abdul Qayyum, Imran Razzak, M Tanveer, Moona Mazher, and Bandar Alhaqbani. 2023. High-density electroencephalography and speech signal based deep framework for clinical depression diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2023).
  • Qiu et al. (2022) Lina Qiu, Yongshi Zhong, Qiuyou Xie, Zhipeng He, Xiaoyun Wang, Yingyue Chen, Chang’an A Zhan, and Jiahui Pan. 2022. Multi-modal integration of EEG-fNIRS for characterization of brain activity evoked by preferred music. Frontiers in Neurorobotics 16 (2022), 823435.
  • Rocco et al. (2021) Giulia Rocco, Jerome Lebrun, Olivier Meste, and M-N Magnie-Mauro. 2021. A Chiral fNIRS Spotlight on Cerebellar Activation in a Finger Tap** Task. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 1018–1021.
  • Ruotsalo et al. (2023) Tuukka Ruotsalo, Kalle Mäkelä, Michiel M. Spapé, and Luis A. Leiva. 2023. Feeling Positive? Predicting Emotional Image Similarity from Brain Signals. In 31st ACM International Conference on Multimedia. 5870–5878.
  • Shah et al. (2019) Dhvani Shah, Grace Y Wang, Maryam Doborjeh, Zohreh Doborjeh, and Nikola Kasabov. 2019. Deep learning of eeg data in the neucube brain-inspired spiking neural network architecture for a better understanding of depression. In 26th International Conference on Neural Information Processing. Springer, 195–206.
  • Shao et al. (2023) Kai Shao, Yixue Hao, Long Hu, Xiaofen Zong, and Min Chen. 2023. Data Augmentation and Pseudo-sequence of fNIRS for Depression Recognition. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2223–2226.
  • Shen et al. (2023a) Jian Shen, Jiaying Chen, Yu Ma, Zheyu Cao, Yanan Zhang, and Bin Hu. 2023a. Explainable Depression Recognition from EEG Signals via Graph Convolutional Network. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1406–1412.
  • Shen et al. (2022) Jian Shen, Yanan Zhang, Huajian Liang, Zeguang Zhao, Qunxi Dong, Kun Qian, Xiaowei Zhang, and Bin Hu. 2022. Exploring the intrinsic features of EEG signals via empirical mode decomposition for depression recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2022), 356–365.
  • Shen et al. (2023b) Jian Shen, Yanan Zhang, Huajian Liang, Zeguang Zhao, Kexin Zhu, Kun Qian, Qunxi Dong, Xiaowei Zhang, and Bin Hu. 2023b. Depression recognition from EEG signals using an adaptive channel fusion method via improved focal loss. IEEE Journal of Biomedical and Health Informatics (2023).
  • Shin et al. (2018) Jaeyoung Shin, **uk Kwon, and Chang-Hwan Im. 2018. A ternary hybrid EEG-NIRS brain-computer interface for the classification of brain activation patterns during mental arithmetic, motor imagery, and idle state. Frontiers in Neuroinformatics 12 (2018), 5.
  • Song et al. (2014) Hong Song, Weilong Du, Xin Yu, Wentian Dong, Wenxiang Quan, Weimin Dang, Huijun Zhang, Ju Tian, and Tianhang Zhou. 2014. Automatic depression discrimination on FNIRS by using general linear model and SVM. In 7th International Conference on Biomedical Engineering and Informatics. IEEE, 278–282.
  • Song et al. (2018) Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. 2018. EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing 11, 3 (2018), 532–541.
  • Taquet et al. (2021) Maxime Taquet, Emily A Holmes, and Paul J Harrison. 2021. Depression and anxiety disorders during the COVID-19 pandemic: knowns and unknowns. The Lancet 398, 10312 (2021), 1665–1666.
  • Uddin et al. (2022) Md Zia Uddin, Kim Kristoffer Dysthe, Asbjørn Følstad, and Petter Bae Brandtzaeg. 2022. Deep learning for prediction of depressive symptoms in a large textual dataset. Neural Computing and Applications 34, 1 (2022), 721–744.
  • Vai et al. (2020) Benedetta Vai, Lorenzo Parenti, Irene Bollettini, Cristina Cara, Chiara Verga, Elisa Melloni, Elena Mazza, Sara Poletti, Cristina Colombo, and Francesco Benedetti. 2020. Predicting differential diagnosis between bipolar and unipolar depression with multiple kernel learning on multimodal structural neuroimaging. European Neuropsychopharmacology 34 (2020), 28–38.
  • Wang et al. (2023) Guangming Wang, Ning Wu, Yi Tao, Won Hee Lee, Zehong Cao, Xiangguo Yan, and Gang Wang. 2023. The Diagnosis of Major Depressive Disorder Through Wearable fNIRS by Using Wavelet Transform and Parallel-CNN Feature Fusion. IEEE Transactions on Instrumentation and Measurement 72 (2023), 1–11.
  • Wang et al. (2021) Rui Wang, Yixue Hao, Qiao Yu, Min Chen, Iztok Humar, and Giancarlo Fortino. 2021. Depression analysis and recognition based on functional near-infrared spectroscopy. IEEE Journal of Biomedical and Health Informatics 25, 12 (2021), 4289–4299.
  • Wang et al. (2022) Zenghui Wang, Jun Zhang, Xiaochu Zhang, Peng Chen, and Bing Wang. 2022. Transformer model for functional near-infrared spectroscopy classification. IEEE Journal of Biomedical and Health Informatics 26, 6 (2022), 2559–2569.
  • Wei et al. (2021) YanYan Wei, Qi Chen, Adrian Curtin, Li Tu, Xiaochen Tang, YingYing Tang, LiHua Xu, ZhenYing Qian, Jie Zhou, ChaoZhe Zhu, et al. 2021. Functional near-infrared spectroscopy (fNIRS) as a tool to assist the diagnosis of major psychiatric disorders in a Chinese population. European Archives of Psychiatry and Clinical Neuroscience 271 (2021), 745–757.
  • Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3634–3640.
  • Yu et al. (2020) Chi-Lin Yu, Hsin-Chin Chen, Zih-Yun Yang, and Tai-Li Chou. 2020. Multi-time-point analysis: A time course analysis with functional near-infrared spectroscopy. Behavior Research Methods 52 (2020), 1700–1713.
  • Yu et al. (2022) Qiao Yu, Rui Wang, Jia Liu, Long Hu, Min Chen, and Zhongchun Liu. 2022. GNN-Based Depression Recognition Using Spatio-Temporal Information: A fNIRS Study. IEEE Journal of Biomedical and Health Informatics 26, 10 (2022), 4925–4935.
  • Zhang et al. (2023b) Chutian Zhang, Hongjun Yang, Chen-Chen Fan, Sheng Chen, Chenyu Fan, Zeng-Guang Hou, **gyao Chen, Liang Peng, Kexin Xiang, Yi Wu, et al. 2023b. Comparing Multi-Dimensional fNIRS Features Using Bayesian Optimization-Based Neural Networks for Mild Cognitive Impairment (MCI) Detection. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 1019–1029.
  • Zhang et al. (2023a) Yukun Zhang, Shuang Qiu, and Huiguang He. 2023a. Multimodal motor imagery decoding method based on temporal spatial feature alignment and fusion. Journal of Neural Engineering 20, 2 (2023), 026009.
  • Zhang et al. (2019) Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang. 2019. Hierarchical graph pooling with structure learning. arXiv preprint arXiv:1911.05954 (2019).
  • Zheng et al. (2020) Shuzhen Zheng, Chang Lei, Tao Wang, Chunyun Wu, Jieqiong Sun, and Hong Peng. 2020. Feature-level fusion for depression recognition based on fnirs data. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2906–2913.
  • Zhu et al. (2020) Yibo Zhu, Jagadish K Jayagopal, Ranjana K Mehta, Madhav Erraguntla, Joseph Nuamah, Anthony D McDonald, Heather Taylor, and Shuo-Hsiu Chang. 2020. Classifying major depressive disorder using fNIRS during motor rehabilitation. IEEE Transactions on Neural Systems and Rehabilitation Engineering 28, 4 (2020), 961–969.