Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection

Dinuka Sandun Udayantha11, Kavindu Weerasinghe1, Nima Wickramasinghe1, Akila Abeyratne1,
Kithmin Wickremasinghe2, Jithangi Wanigasinghe3, Anjula De Silva1, and Chamira Edussooriya12
1Dept. Electronic and Telecommunication Eng., University of Moratuwa, Sri Lanka
2Dept. Electrical and Computer Eng., University of British Columbia, Canada
3Department of Paediatrics, Faculty of Medicine, University of Colombo, Sri Lanka
email: [email protected] , [email protected]
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Abstract

The neonatal period is the most vulnerable time for the development of seizures. Seizures in the immature brain lead to detrimental consequences, therefore require early diagnosis. The gold-standard for neonatal seizure detection currently relies on continuous video-EEG monitoring; which involves recording multi-channel electroencephalogram (EEG) alongside real-time video monitoring within a neonatal intensive care unit (NICU). However, video-EEG monitoring technology requires clinical expertise and is often limited to technologically advanced and resourceful settings. Cost-effective new techniques could help the medical fraternity make an accurate diagnosis and advocate treatment without delay. In this work, a novel explainable deep learning model to automate the neonatal seizure detection process with a reduced EEG montage is proposed, which employs convolutional nets, graph attention layers, and fully connected layers. Beyond its ability to detect seizures in real-time with a reduced montage, this model offers the unique advantage of real-time interpretability. By evaluating the performance on the Zenodo dataset with 10-fold cross-validation, the presented model achieves an absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, respectively.

Index Terms:
seizure detection, electroencephalogram (EEG), convolutional neural network (CNN), graph attention (GAT), explainability

I INTRODUCTION

Refer to caption
Figure 1: The proposed deep learning model architecture. The first 4 blocks belong to the CNN-based temporal feature extractor. The 12 channels are preserved throughout the CNN encoder but down-sampled the temporal features at the end of each block using Average Pooling. GAT layers 1, 2, and 3 have the output shapes (12×37)1237(12\times 37)( 12 × 37 ), (12×32)1232(12\times 32)( 12 × 32 ), and (12×16)1216(12\times 16)( 12 × 16 ), respectively and the last multilayer perceptron (MLP) network has 32, 16, and 1 neurons, respectively.

Neonatal seizures are epileptic seizures that occur in infants that are younger than 4 weeks. In an EEG, it can be identified as an occurrence of sudden, abnormal, and paroxysmal ictal rhythm with a 2 μV𝜇𝑉\mu Vitalic_μ italic_V or higher amplitude. This 4-week neonatal period is the most vulnerable time to develop seizures, capable of causing significant harm to the develo** brain and necessitating prompt diagnosis followed by treatment. According to Kang et al. [1], the risk is greatest during the first 1-2 days. The prevalence and importance of aetiological factors for neonatal seizures are continuously changing and differ between developed and develo** countries depending on the available care in NICUs. Among the numerous aetiological factors, hypoxic-ischaemic encephalopathy is the most common, especially among term neonates [2]. Epidemiology of neonatal seizures shows a high incidence rate in low-income settings [3, 4]. For example, Sri Lankan data reveals that 3 per 1000 live births in term neonates and 7.5 per 1000 births in preterm neonates experience seizures. Since there are significantly fewer or no facilities for EEG monitoring in almost all of the Sri Lankan NICUs, these figures are likely to be under-reporting[5].

Detecting neonatal seizures is particularly challenging because they often manifest subtly and can be mistaken for normal physiological behaviors. Therefore, having an objective monitoring method is critical. The gold standard method is video-EEG monitoring, which requires continuously monitoring the infant’s brain activity using an EEG during suspected seizure events. This method, while reliable, is resource-intensive and may not always be feasible in resource limited clinical settings, mainly due to the unavailability of suitable equipment, the lack of experienced neonatologists and neurophysiologists for patient monitoring; ultimately contributing to critical causes for delays in the diagnosis process [6]. Therefore, several studies have been done in the past years to replace this monitoring task with machine learning and deep learning. Despite their reported performance, none of these have been integrated into hospital settings.

Among the earliest works, Temko et al. [7] designed a support vector machine classifier for a dataset from Cork University Maternity Hospital. With the recent advancements in deep learning, deep convolutional neural networks (CNN), recurrent neural networks, and long short-term memory, several studies have been carried out to classify EEG signals. In [9, 8], authors have applied 2D convolutions to detect seizures where in [8], the input EEG signal is treated as a 2D image and in [9], the input is the spectrogram of the EEG epoch. Recently, models such as STATENET [10] and ST-GAT [11] have been introduced, where the temporal and spatial features are considered for the model prediction. The main drawback of these existing models is not being scalable to a reduced number of channels, which becomes an important requirement for neonatal seizure detection using low-cost hardware accessible to resource limited environments. Further, the model convergence is very slow and not able to explain the output concerning the particular EEG channels and time intervals of the input EEG epoch. Michele et al. [12] introduced an explainable deep learning model for blink detection from EEG using gradient-weighted class activation map** (Grad-cam) [13] method which is capable of showing exactly where the blink occurs in the EEG epoch. In addition, several other studies in self-supervised learning (SSL) [14, 15] were conducted to detect seizures. However, SSL-based methods provide slightly poor performance compared to the state-of-the-art (SOTA), such as an 8% reduction in AUC.

In this work, we introduce a novel explainable deep learning model architecture that is capable of detecting seizures from EEG signals from a reduced EEG montage and interpreting the results in real-time. As the CNN is a dominant architecture in computer vision tasks [16] and sequence transduction tasks [17], our work leverages a CNN encoder, where 1D convolutions are performed to extract temporal features, a graph attention (GAT) network for spatial feature extraction, and a binary classification head to classify EEG signals to seizure and normal states. For 80% training data and 20% test data, the model achieves an absolute improvement of 2.71% and 16.33% in AUC and recall, respectively. Moreover, when evaluated with a 10-fold cross-validation, the model achieves an absolute improvement of 8.31% and 42.86% in AUC and recall, respectively.

II PROPOSED MODEL ARCHITECTURE

This section is divided into two sections. In section II-A, the proposed deep learning architecture is introduced in detail. To this end, we discuss the CNN encoder followed by the GAT network. In section II-B, model interpretability is discussed.

II-A Deep Learning Model Architecture

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) The proposed reduced montage electrode placement for seizure detection on the international 10-20 system (b) The illustration of the employed reduced montage graph representation of the selected electrode montage. The graph nodes represent the channels and the edges represent the functional connectivity between channels.

In this section, we introduce the proposed novel deep-learning model for real-time seizure detection from neonatal EEG signals. The proposed model employs 1) a CNN encoder II-A1, 2) a GAT network II-A2, and 3) a fully connected classification head II-A3. The CNN encoder is used to extract the temporal features from the EEG epochs and the graph attention encoder is used to extract spatial features from the output of the CNN encoder. Apart from seizure detection, notably, we integrate interpretability in our model by leveraging a modified Grad-cam [13], to explain which time ranges in each channel of a given EEG epoch contribute more significantly to the respective binary class of the model output.

II-A1 CNN Encoder

As the EEG signals are time series data, we use 1-D convolutions to extract the temporal features from EEG epochs. The CNN encoder employs four blocks, where each block utilizes convolutional layers with (1×5)15(1\times 5)( 1 × 5 ) and (1×7)17(1\times 7)( 1 × 7 ) receptive fields and {32,64,8,1} filters as shown in Fig. 1. After pre-processing the raw EEG data, the input matrix into this CNN encoder has the shape 12×3841238412\times 38412 × 384, where 12 denotes the number of EEG channels and 384 denotes the number of data samples within a time window of 12 s.

Consider F(x,{W1i})𝐹𝑥subscript𝑊1𝑖F(x,\{W_{1i}\})italic_F ( italic_x , { italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT } ) and H(x,{W2i})𝐻𝑥subscript𝑊2𝑖H(x,\{W_{2i}\})italic_H ( italic_x , { italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT } ) as two different map** functions of a set of stacked layers that output two different matrices with the same dimension given the same input matrix x𝑥xitalic_x. Therefore, we are able to do two map**s parallelly and add them together to obtain a new matrix with completely different features as in the equation:

H(x,{Wi})=F(x,{W1i})+H(x,{W2i})𝐻𝑥subscript𝑊𝑖𝐹𝑥subscript𝑊1𝑖𝐻𝑥subscript𝑊2𝑖H(x,\{W_{i}\})=F(x,\{W_{1i}\})+H(x,\{W_{2i}\})italic_H ( italic_x , { italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) = italic_F ( italic_x , { italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT } ) + italic_H ( italic_x , { italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT } ) (1)

This technique is applied in the convolutional Block 1 as seen in Fig. 1, to extract different temporal features simultaneously by applying different kernel sizes in two parallel convolutional layers. Block 1 is followed by another 3 convolutional blocks, each of them having a residual learning framework as proposed in [18] for fast convergence of the model. In this CNN encoder, the skip connections simply perform identical map** to preserve the original input dimensions when adding.

Since this CNN encoder is a tiny network with only 8 convolution layers, it proves difficult to achieve a good training performance with a simple sequential network. In order to introduce non-linearity in our model, we incorporated the widely used rectified linear unit (ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U) as the activation of each convolution layer, since other activation functions may vanish the gradients in backpropagation or result in higher training time. We further experimented with the Swish𝑆𝑤𝑖𝑠Swishitalic_S italic_w italic_i italic_s italic_h activation function, however, that increased the training time by approximately 13 minutes compared to the ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function, which did not seem beneficial.

After adding convolution outputs in each block, average pooling is performed to downsample the feature map by a factor of 2 for better optimization of the model and batch normalization as a solution to the gradient exploding issue according to [19]. The reason for using average pooling instead of max pooling is to aggregate more temporal information into one feature point rather than solely depending on a single value within a moving window. This CNN encoder is designed such that it reduces overfitting and training time, and stops gradient degradation with the help of residual and parallel connections. The selection of the receptive fields, number of filters, and layers is decided by a rigorous ablation study.

II-A2 GAT Network

Graph Representation: After extracting the temporal features from a signal, we need to extract spatial features from the EEG epochs with the help of interchannel connectivity. Here, we employ a graph GAT network [11, 20] to extract the spatial feature. The requirements to generate the graph are the vertices that correspond to the 12 channels, their feature vectors which are the output features from the CNN encoder for each channel, and the adjacency matrix to denote the functional connectivity between channel pairs. Further, the selected channels given in section III, have the capability of modeling brain connectivity as in the graph in Fig. 2(b).

According to Tekgul et al. in [21], the localization of EEG seizures with reduced electrode montage is acceptable and compared to a standard 10-20 EEG system as most neonatal seizures occur in the central zone of the brain. Hence, not considering global inter-hemisphere connections in the Front Lobe and Parietal Lobe would not reduce the network efficiency. Additionally, other neonatal seizures occur in bilateral posterior and anterior regions which are covered by O1, O2, and Fp1, Fp2, respectively. Also, due to the EEG channels T3-C3, C3-CZ, CZ-C4, and C4-T4, the biological connection between the left and right hemispheres is maintained throughout the process. Therefore, with the designed graph, it is possible to leverage the information passing between left and right parieto-occipital or fronto-temporal zones. As neonates have very small brains, the proposed graph rarely misses the event of a seizure [22].

Refer to caption
Figure 3: Performance comparison between GAT layers and scaled dot product attention layers. Training for dot product attention was terminated after 50 epochs due to low performance.

Attention Layers: It is imperative to pay attention to the connected EEG channels in the selected electrode montage when extracting spatial features. From two widely used approaches for building an attention mechanism, 1) a network built with GAT layers [20] or 2) a network built with scaled dot-product attention [23], we opted for a GAT as it slightly outperforms scaled dot-product attention as shown in Fig. 3. In addition to this, the fact that the brain network could be modeled as a graph motivated us to apply a GAT network to extract spatial features.

In a GAT layer, each node aggregates features from adjacent nodes and constructs a new feature set for itself. Given the feature sets for each node; H12×F𝐻superscript12𝐹H{\in}{\mathbb{R}^{12\times F}}italic_H ∈ blackboard_R start_POSTSUPERSCRIPT 12 × italic_F end_POSTSUPERSCRIPT, a learnable weight matrix; WF×F𝑊superscript𝐹superscript𝐹W{\in}{\mathbb{R}^{F\times F^{{}^{\prime}}}}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is required to linearly transform the input features to high-level features by the simple matrix multiplication; H×W12×F𝐻𝑊superscript12superscript𝐹H\times W{\in}{\mathbb{R}^{12\times F^{{}^{\prime}}}}italic_H × italic_W ∈ blackboard_R start_POSTSUPERSCRIPT 12 × italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Next, a shared masked self-attention was performed to compute the attention coefficients and masking is done according to the adjacency matrix. The equation 2 from Peter et al. [20], explains how to compute the attention coefficients from the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT node to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT node (αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT)

αij=exp(LeakyReLU([hiW||hjW]aT))k𝒩iexp(LeakyReLU([hiW||hkW]aT))\alpha_{ij}=\frac{\text{exp(LeakyReLU(}[\vec{h}_{i}W||\vec{h}_{j}W]\vec{a}^{T}% ))}{\sum_{k{\in}\mathcal{N}_{i}}\text{exp(LeakyReLU(}[\vec{h}_{i}W||\vec{h}_{k% }W]\vec{a}^{T}))}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG exp(LeakyReLU( [ over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W | | over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_W ] over→ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT exp(LeakyReLU( [ over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W | | over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_W ] over→ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) end_ARG (2)

Here, hisubscript𝑖\vec{h}_{i}over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a row of H𝐻Hitalic_H matrix, 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes all the neighbor nodes of node i𝑖iitalic_i and itself, a2F𝑎superscript2superscript𝐹\vec{a}\in\mathbb{R}^{2F^{{}^{\prime}}}over→ start_ARG italic_a end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a learnable weight vector, and ||||| | represents concatenation. Once these attention coefficients are obtained, the new features set for each node are calculated by a simple non-linear transformation;

H=ELU(AHW)superscript𝐻𝐸𝐿𝑈𝐴𝐻𝑊H^{{}^{\prime}}=ELU(AHW)italic_H start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_E italic_L italic_U ( italic_A italic_H italic_W ) (3)

where A12×12𝐴superscript1212A{\in}{\mathbb{R}}^{12\times 12}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 12 × 12 end_POSTSUPERSCRIPT is the masked attention coefficient matrix.

This mask makes sure that a node pays attention only to itself and its first-order neighbors. Hence, we apply 3 GAT layers after the CNN encoder, to achieve an optimal spatial receptive field by aggregating features from the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order neighbors which will efficiently cover 78% of the brain network. If one or two more layers are applied, the receptive field will be increased, however, it will not improve the network efficiency and model performance. Therefore it was proved experimentally that the optimal number of GAT layers should be 3. Here, as mentioned in Fig. 1, the output feature maps of the GAT layers 1, 2, and 3 have the shapes (12×37)1237(12\times 37)( 12 × 37 ), (12×32)1232(12\times 32)( 12 × 32 ), and (12×16)1216(12\times 16)( 12 × 16 ), respectively. In their work, Raeis et al. [11] adopted a similar approach for feature map selection, while employing 18 EEG channels.

II-A3 Classification Head

After extracting temporal and spatial features from the proposed CNN encoder and GAT network, we perform the classification task through a multilayer perception (MLP). This network consists of 3 dense layers (fully connected layers) of 32, 16, and 1 neurons, respectively. The first and second dense layers are followed by the ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function while the final layer is followed by the Sigmoid𝑆𝑖𝑔𝑚𝑜𝑖𝑑Sigmoiditalic_S italic_i italic_g italic_m italic_o italic_i italic_d function. A global average pooling layer is applied along the temporal axis before the 3 dense layers in order to reduce the GAT output dimensions to (12×1)121(12\times 1)( 12 × 1 ).

II-B Model Interpretability

High transparency is essential in deep learning applications in medicine. This emphasizes the relevance of explainability in medical AI. To this end, this work proposes a new approach leveraging Grad-cam [13], in which the gradient of the class activation (logit value) is obtained with respect to the activations of the last GAT layer to generate a heatmap of the shape of the input signal. Once the gradient computation is complete, the final GAT layer outputs are scaled by the mean of gradients. The resulting values are passed through a ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U activation function and then normalized using min-max normalization. This generates a heatmap, where 0 represents the least relevance and 1 represents the highest relevance to the output. This heatmap is then mapped to a standard colormap to visualize the dependence of the relevance of specific time periods in each of the 12 channels of a given EEG epoch to the respective binary class of the model output. This is clearly visualized in Fig. 4 with a blue-white-red (bwr) colormap. Further, this step does not affect the deep learning model binary class output as it runs as a post-classification task.

III DATASET AND PRE-PROCESSING

In this study, the publicly available Helsinki Zenodo scalp EEG dataset [24] is used to train and test the deep learning model. This open-source dataset contains 74-min (median) long multi-channel EEG recordings, sampled at 256 Hz, from 79 term neonates admitted to the NICU at Helsinki University Hospital, Finland. It consists of 3 annotation files created by 3 independent trained neurologists. As a result, only 39 neonates were identified as having seizures by consensus, while 22 were identified as seizure-free.

This dataset, recorded with respect to a reference point, allows for the construction of a number of EEG channels. According to the American Clinical Neurophysiology Society recommendations, the electrode placement should follow the international 10-20 system modified for neonates. Consequently, most existing methods, including the state-of-the-art, utilize 18 EEG channels for training and evaluation. Even though the full array is recommended, the Minimum Technical Standards for Pediatric Electroencephalography states that it is acceptable to use a reduced array wherever necessary [25, 26]. Hence, our study employs just 12 channels, selected based on Tekgul et al. [21] to model the double banana-shaped reduced electrode montage. The specific channels used are, Fp1-T3, T3-O1, Fp1-C3, C3-O1, Fp2-C4, C4-O2, Fp2-T4, T4-O2, T3-C3, C3-CZ, CZ-C4, C4-T4 as shown in Fig. 2(a).

Signal pre-processing plays a crucial role when it comes to EEG signals due to their added noise and artifacts. Additionally, in this dataset, some EEG signals exhibit flat lines at 0 V, necessitating their removal before applying further processing techniques. We implement an automated procedure to remove these flat lines from the signals only if all selected channels within a specific time range contain flat lines. Subsequently, a bandpass Chebyshev type-II digital filter with cutoff frequencies of 1 Hz and 16 Hz is applied forward and backward to the signals to eliminate baseline drift and high-frequency noise components, including power line interference. As mentioned in  II-A1, since the deep learning model utilizes 12-second EEG epochs, maintaining the original sampling frequency of 256 Hz would significantly increase model complexity and training and inference times. Therefore, all signals are carefully down-sampled to 32 Hz to avoid aliasing, as in [7]. Finally, these EEG epochs are normalized to scale the EEG signal amplitudes.

We select only the 39 neonates with seizures identified by consensus to train and evaluate our model as the other 22 neonatal seizure-free signals cause huge class imbalance issues and the remaining 18 neonatal signals have the potential to be misclassified as actual seizures. Although neonates with seizures by consensus are selected, there is a significant difference in total seizure duration compared to non-seizure duration. The total seizure duration represents only 18.14% of the total EEG signal duration. To address this class imbalance issue, we adopted a technique from [11], where we overlap each 12-second epoch with 11 seconds for seizure segments and 10 seconds for non-seizure segments. While this approach does not entirely eliminate the class imbalance, it effectively reduces its impact on displaying a seizure-to-non-seizure epochs ratio of 1:2. The remaining class imbalance issue is mitigated by applying focal binary cross entropy loss in training.

IV MODEL TRAINING

TABLE I: Model performance comparison. CV - Cross Validation
Number of EEG Method Accuracy AUC Recall Precision Kappa
channels mean±plus-or-minus\pm± std Median (IQR) Mean±plus-or-minus\pm±std
MSC-GCNN[27] - 99.10 (96.80,99.60) 94.70±plus-or-minus\pm±10.90 96.71 - 0.80
181 PLV-GCNN[27] - 99.00 (95.20,99.70) 94.10±plus-or-minus\pm±10.50 95.30 - 0.79
SD-GCNN[27] - 97.30 (86.30,99.60) 90.09±plus-or-minus\pm±13.50 96.68 - 0.71
ST-GAT(FL)[11] - 99.30 (96.40,99.50) 96.60±plus-or-minus\pm±8.90 98.00 - 0.88
12 ST-GAT (FL)2 80.29±plus-or-minus\pm±9.48 83.98 (77.80,90.90) 83.15±plus-or-minus\pm±8.85 39.98 94.91 0.43
10-fold CV Our method 89.02±plus-or-minus\pm±2.91 91.84 (88.57,95.21) 91.46±plus-or-minus\pm±4.36 82.84 94.23 0.89
12 ST-GAT (FL)2 88.80 91.71 66.89 95.17 0.71
(80%-20%) Our method 91.56 94.42 83.22 88.61 0.80
1These models were trained with 18-channel full montage EEG data. Since no prior study had been conducted using the 12-channel reduced
montage, we report these here to demonstrate that our model performs better than SOTA methods in terms of Cohen’s kappa while having a
reduced number of channels.
2We retrained ST-GAT (FL), the best variant of the ST-GAT, for the 12-channel reduced montage and reported the evaluated results to show a
fair comparison between our model and the current best SOTA model.

This section describes our model training and evaluation criteria. The model is trained and evaluated in two approaches; 1) allocating randomly selected 31 subjects for the training data set (similar-to\sim80%) and the remaining 8 subjects for the test dataset (similar-to\sim20%) from 39 neonates with seizures by consensus and 2) performing 10-fold cross-validation on these 39 neonates. The model is trained for 100 epochs with 512 mini-batch size and the Adam optimizer is applied with a 0.002 learning rate as well as focal binary cross entropy loss with γ=2𝛾2\gamma=2italic_γ = 2 and α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 as the loss function to address the class imbalance issue between seizure and non-seizure samples. As the proposed model has significantly fewer parameters than other deep learning models with only 46612 trainable and 208 non-trainable parameters, one training epoch takes 45 s on average to finish. We apply a dropout of 0.2 probability at the end of each convolutional Block 1, 2, 3, and between every GAT layer and dense layer. Additionally, a L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT kernel regularizer with a 0.0001 regularization value is selected. These hyperparameters were selected after doing a comprehensive ablation study.

V RESULTS AND DISCUSSION

Table I compares the performance results of previously published models with our proposed method. These comparisons are inequitable due to the fact that previously published methods incorporate 18 EEG channels, which are obtained from the full electrode montage. In order to ensure a fair comparison with the SOTA, we retrained the best performing model among the current SOTA, ST-GAT (FL) which is the best variant of ST-GAT as in [11], for the 12-channel reduced montage and reported the results. Our model performs better when evaluated with a 10-fold cross-validation, where the model achieves an absolute improvement of 8.31% and 42.86% in mean AUC and recall, respectively. Further, for 80% training data and 20% test data, the model achieves an absolute improvement of 2.71% and 16.33% in mean AUC and recall, respectively. Further, our model evaluated with 10-fold cross-validation, has the highest Cohen’s kappa value regardless of the reduced montage.

Our model not only detects seizures in EEG epochs, but also provides interpretable outputs for real-time analysis. This highlights the specific channels and time windows that are most critical for the model’s decision. By visualizing these regions using a bwr colormap as in Fig. 4, we can assess the model’s ability to understand and interpret the input. The first subplot of Fig. 4 shows the true labels and predicted seizure probability for a 7.5-minute long EEG signal. As shown in the next 12 subplots, our model effectively differentiates between seizure activity (from 4 mins 25 secs onwards) and the seizure-free period (up to 4 mins 25 secs). This demonstrates the model’s capability to distinguish between these two crucial elements in EEG seizure analysis even though the artifacts in the seizure-free region are in the same amplitude range as seizures. Since the understanding of how the model identifies a given EEG epoch as a seizure epoch is more important, we visualize the heatmaps of only EEG epochs detected as seizure by the model otherwise, we visualize EEG epochs in blue color. The last subplot of Fig. 4, a zoomed-in time window of the Fp1-T3 channel, shows the onset of the seizure which demonstrates how specific channels and time windows of the EEG recording were critical to detect the occurrence of the seizure.

Refer to caption
Figure 4: Top subplot: Comparison between true labels and model prediction probabilities. Next 12 subplots: Visualization of 7.5 minutes EEG, where the recording is observed to be seizure free up to 4 mins 25 secs, and a seizure occurs past this point. Last subplot: A zoomed in version for better visualization of seizure onset and how the relevance changes.

VI CONCLUSION

Although neonatal seizure detection is a challenging task even for experienced professionals, by introducing well-defined signal processing techniques and deep learning models, we may attempt to make the task less challenging. By leveraging explainable AI and a seizure detection probability distribution, not only experienced professionals, but less experienced professionals can gain the skillset to diagnose seizure events accurately and provide prompt management. To this end, we have presented an efficient, reliable, and unique deep learning architecture built upon a CNN encoder to extract temporal features, a GAT network to extract spatial features and a binary classification head. On average, it takes only 62 ms to detect seizures in a 12-second EEG epoch on the CPU, with even faster processing (32 ms) on the GPU. Beyond its ability to detect seizures in real-time with a reduced montage, this model offers the unique advantage of real-time interpretability. This allows for quick and insightful analysis. A modified version of Grad-cam is employed to explain the model’s binary class output demonstrating which channels and time windows have been looked at by the model when a seizure is detected. The reduced montage employs only 9 electrodes making it easy to prepare the subject for testing and increasing the patient comfortability.

For future work, we see great potential for the medical field in improving the model’s performance with real-time fast artifact removal, and embedded machine learning for real-time seizure detection. An important task is to improve model performance by self-supervised training with a large, unlabeled EEG dataset. To accomplish this, as our next step, we are planning to test this proposed trained model with data collected under the supervision of trained experts from The Lady Ridgeway Hospital in Colombo, Sri Lanka.

ACKNOWLEDGEMENT

The authors would like to extend their gratitude to the department for allowing them to access the university computational resources to carry out this research successfully, which were funded by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education of Sri Lanka, funded by the World Bank. The authors would also like to thank the staff of The Lady Ridgeway Hospital in Colombo and Department of Paediatrics, Faculty of Medicine, University of Colombo for their dedication and support to Neonatal Care in Sri Lanka.

References

  • [1] Kang Seok Kyu, and Shilpa D. Kadam. “Neonatal seizures: impact on neurodevelopmental outcomes.” frontiers in Pediatrics 3 (2015): 101.
  • [2] Panayiotopoulos C. P. “Neonatal seizures and neonatal syndromes.” In The epilepsies: seizures, syndromes and management. Bladon Medical Publishing, 2005.
  • [3] Mwaniki Michael, Ali Mathenge, Samson Gwer, Neema Mturi, Evasius Bauni, Charles RJC Newton, James Berkley, and Richard Idro. “Neonatal seizures in a rural Kenyan District Hospital: aetiology, incidence and outcome of hospitalization.” BMC medicine 8 (2010): 1-8.
  • [4] Pisani Francesco, Carlotta Facini, Elisa Bianchi, Giorgia Giussani, Benedetta Piccolo, and Ettore Beghi. “Incidence of neonatal seizures, perinatal risk factors for epilepsy and mortality after neonatal seizures in the province of Parma, Italy.” Epilepsia 59, no. 9 (2018): 1764-1773.
  • [5] Wanigasinghe J, Kapurubandara R, Arambepola C, Sri Ranganathan S, and Philips J. 2017. “Incidence of neonatal seizures in babies born in two premier maternity hospitals within Colombo city.” Association of Sri Lankan Neurologists 11th Annual Academic Sessions. 43-44.
  • [6] Young G. Bryan. “Continuous EEG monitoring in the ICU: challenges and opportunities.” The Canadian Journal of Neurological sciences. Le Journal Canadien des Sciences Neurologiques 36 (2009): S89-91.
  • [7] Temko Andriy, Eoin Thomas, William Marnane, Gordon Lightbody, and G. Boylan. “EEG-based neonatal seizure detection with support vector machines.” Clinical Neurophysiology 122, no. 3 (2011): 464-473.
  • [8] Hossain M. Shamim, Syed Umar Amin, Mansour Alsulaiman, and Ghulam Muhammad. “Applying deep learning for epilepsy seizure detection and brain map** visualization.” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, no. 1s (2019): 1-17.
  • [9] Thodoroff Pierre, Joelle Pineau, and Andrew Lim. “Learning robust features using deep learning for automatic seizure detection.” In Machine learning for healthcare conference, pp. 178-190. PMLR, 2016.
  • [10] Li Ziyue, Yuchen Fang, You Li, Kan Ren, Yansen Wang, Xufang Luo, Juanyong Duan, Congrui Huang, Dongsheng Li, and Lili Qiu. “Protecting the Future: Neonatal Seizure Detection with Spatial-Temporal Modeling.” In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 196-201. IEEE, 2023.
  • [11] Raeisi Khadijeh, Mohammad Khazaei, Gabriella Tamburro, Pierpaolo Croce, Silvia Comani, and Filippo Zappasodi. “A class-imbalance aware and explainable spatio-temporal graph attention network for neonatal seizure detection.” International Journal of Neural Systems 33, no. 9 (2023): 2350046.
  • [12] Giudice Michele Lo, Nadia Mammone, Cosimo Ieracitano, Maurizio Campolo, Arcangelo Ranieri Bruna, Valeria Tomaselli, and Francesco Carlo Morabito. “Visual explanations of deep convolutional neural network for eye blinks detection in eeg-based bci applications.” In 2022 International Joint Conference on Neural Networks (IJCNN), pp. 01-08. IEEE, 2022.
  • [13] Selvaraju Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. “Grad-cam: Visual explanations from deep networks via gradient-based localization.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 618-626. 2017.
  • [14] Das Sudip, Pankaj Pandey, and Krishna Prasad Miyapuram. “Improving self-supervised pretraining models for epileptic seizure detection from EEG data.” arXiv preprint arXiv:2207.06911 (2022).
  • [15] Cai Donghong, Junru Chen, Yang Yang, Teng Liu, and Yafeng Li. “MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals.” In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 130-141. 2023.
  • [16] Krizhevsky Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012).
  • [17] Gehring Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. “Convolutional sequence to sequence learning.” In International conference on machine learning, pp. 1243-1252. PMLR, 2017.
  • [18] He Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.
  • [19] Ioffe Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” In International conference on machine learning, pp. 448-456. pmlr, 2015.
  • [20] Veličković Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. “Graph Attention Networks.” In International Conference on Learning Representations. 2018.
  • [21] Tekgul Hasan, Blaise FD Bourgeois, Kimberlee Gauvreau, and Ann M. Bergin. “Electroencephalography in neonatal seizures: comparison of a reduced and a full 10/20 montage.” Pediatric neurology 32, no. 3 (2005): 155-161.
  • [22] Stevenson, Nathan J., Leena Lauronen, and Sampsa Vanhatalo. “The effect of reducing EEG electrode number on the visual interpretation of the human expert for neonatal seizure detection.” Clinical Neurophysiology 129, no. 1 (2018): 265-270.
  • [23] Vaswani Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
  • [24] Stevenson, Nathan J., Karoliina Tapani, Leena Lauronen, and Sampsa Vanhatalo. “A dataset of neonatal EEG recordings with seizure annotations.” Scientific data 6, no. 1 (2019): 1-8.
  • [25] Shellhaas Renée A., Taeun Chang, Tammy Tsuchida, Mark S. Scher, James J. Riviello, Nicholas S. Abend, Sylvie Nguyen, Courtney J. Wusthoff, and Robert R. Clancy. “The American Clinical Neurophysiology Society’s guideline on continuous electroencephalography monitoring in neonates.” Journal of clinical neurophysiology 28, no. 6 (2011): 611-617.
  • [26] Kuratani John, Phillip L. Pearl, Lucy R. Sullivan, Rosario Maria S. Riel-Romero, Janna Cheek, Mark M. Stecker, Daniel San Juan Orta et al. “American clinical neurophysiology society guideline 5: minimum technical standards for pediatric electroencephalography.” The Neurodiagnostic Journal 56, no. 4 (2016): 266-275.
  • [27] Raeisi Khadijeh, Mohammad Khazaei, Pierpaolo Croce, Gabriella Tamburro, Silvia Comani, and Filippo Zappasodi. “A graph convolutional neural network for the automated detection of seizures in the neonatal EEG.” Computer methods and programs in biomedicine 222 (2022): 106950.