Evaluating Fast Adaptability of Neural Networks for Brain-Computer Interface

Anupam Sharma Department of Computer Science & Engineering
Indian Institute of Technology Gandhinagar
Gujarat, India
[email protected]
   Krishna Miyapuram Department of Cognitive & Brain Sciences
Indian Institute of Technology Gandhinagar
Gujarat, India
[email protected]
Abstract

Electroencephalography (EEG) classification is a versatile and portable technique for building non-invasive Brain-computer Interfaces (BCI). However, the classifiers that decode cognitive states from EEG brain data perform poorly when tested on newer domains, such as tasks or individuals absent during model training. Researchers have recently used complex strategies like Model-agnostic meta-learning (MAML) for domain adaptation. Nevertheless, there is a need for an evaluation strategy to evaluate the fast adaptability of the models, as this characteristic is essential for real-life BCI applications for quick calibration. We used motor movement and imaginary signals as input to Convolutional Neural Networks (CNN) based classifier for the experiments. Datasets with EEG signals typically have fewer examples and higher time resolution. Even though batch-normalization is preferred for Convolutional Neural Networks (CNN), we empirically show that layer-normalization can improve the adaptability of CNN-based EEG classifiers with not more than ten fine-tuning steps. In summary, the present work (i) proposes a simple strategy to evaluate fast adaptability, and (ii) empirically demonstrate fast adaptability across individuals as well as across tasks with simple transfer learning as compared to MAML approach.

Index Terms:
Electroencephalography, Brain-computer Interface, Convolutional Neural Network, Transfer Learning, Meta Learning

I Introduction

Brain-computer interface (BCI) systems provide an approach for devising efficient assistive technologies in rehabilitation and healthcare. Electroencephalography (EEG) being the most versatile technique to non-invasively record brain activity, makes it most suitable for building BCI systems. The EEG technique records electrical signals in the brain by placing electrodes on the human scalp. Due to its feasibility in BCI systems, making efficient EEG classifiers is essential for wide-scale applications. In the present work, we focus on EEG classification with fast domain adaptation on signals for a well-known BCI application i.e. while performing body movements (motor movement) and imagining body movements (motor imagery).

Researchers have widely used traditional machine-learning approaches for EEG classification. However, these approaches need extensive feature extraction. With advancements in deep learning, deep neural networks (DNN) is being used widely due to its capability of extracting features implicitly [1, 2]. In this domain, convolutional neural networks (CNN) have been the most widely used neural architecture for EEG classification [1, 2].

The significant challenge in classifying EEG into corresponding cognitive states is the smaller dataset size, and variation in the signals across individuals. This variation leads to a degradation in the performance of EEG classifiers [1, 3] necessitating calibration or adaptation when applied to other domains like newer individuals or task activities. However, it is not feasible in real-time BCI applications to perform substantial calibration for each task and individual. Therefore, it is imperative to devise techniques for fast adaptability of EEG classifiers.

Inspired by computer vision, most CNN-based EEG classifiers use batch-norm. However, EEG datasets are smaller, necessitating a tradeoff between batch size and the number of iterations per epoch. In addition, for fast calibration on new individuals or tasks, collecting large samples is not feasible in real-time applications, making batch-norm inappropriate for fast adaptability. However, EEG signals are long sequences in the time domain, which is enough to calculate statistics for normalization, making layer-norm perfectly applicable for such applications. We empirically show that layer-norm helps adapt EEG classifiers notably faster than the classifiers with batch-norm on new individuals as well as on new activities.

In summary, it is important that BCI systems take minimal time for adaptation in practical applications when used for newer individuals or activities with few samples. Therefore, in this work:

  1. 1.

    We show that changes in normalization techniques following the characteristics of EEG signals can help improve adaptability with vanilla transfer learning.

  2. 2.

    We propose an effective evaluation strategy where we evaluate the fast adaptability of two popular training strategies, MAML and transfer learning, by limiting the fine-tuning iterations to ten, and we show that transfer learning adapts much faster than MAML. Moreover, we evaluate fast adaptability not only on newer individuals but also on signals on newer task / activities.

II Related Work

To tackle the problem of variations in EEG signals, several approaches have been proposed. In [4], authors proposed a Graph-based Convolutional Recurrent Attention Model (G-CRAM) model where the authors used a graph structure to represent the positioning of EEG electrodes and then used a convolutional recurrent attention model to learn EEG features. In [5], authors used the concept of weighted feature fusion in CNN for motor imagery EEG decoding. Researchers have also used feature extraction methods before using deep neural networks. In [6], authors used a combination of common spatial patterns and CNN to develop the decoding model. In addition to generalizing EEG decoders across individuals, researchers have also explored transfer-learning techniques to improve the decoding performance [1, 7]. In [8], the authors have introduced the inception module and residual connection in their model architecture and pre-trained the model on large data. The authors curated the dataset of 280 individuals by collecting the data from common electrodes from multiple datasets. Researchers have also explored meta-learning methods, which observe learning approaches on different tasks and then use this experience to solve a new task faster [9]. One such meta-learning algorithm is model-agnostic meta-learning (MAML) [10] which researchers have started exploring recently [11, 12, 13, 14, 15]. For example, in [11], authors assumed different individuals as a different classification tasks and used MAML for adaptation to newer individuals.

III Methodology

The principle idea of our work is to investigate the fast adaptability of EEG decoding deep neural networks when applied to the signals of newer individuals or signals recorded while performing different activities. Hence, we can have two adaptability tasks:

  • Across individual: Introducing signals of newer individuals, making it the same classification task but on signals of different individuals (In the subsequent section, we often use the term “subject” to refer to an individual).

  • Across Task activity: Introducing signals recorded on newer activities, changing the set of labels to classify, making it a different classification task altogether.

III-A Training strategies

To evaluate the fast adaptability, we compared two training strategies by limiting the number of iterations in fine-tuning to 10101010 and compared the classification performance. In this section, we elaborate on our two training strategies used for domain adaptation experiments for EEG classification.

Model-agnostic meta-learning (MAML) [10]. This strategy is specifically applicable in the few-shot meta-learning setup and is well adapted in the BCI domain [11, 12, 13]. Its goal is to learn parameters so the model can adapt to new tasks using only a few data points and iterations. In this work, we have adapted MAML for EEG classification across subjects as in [11]. Considering the model with parameters θ𝜃\thetaitalic_θ, the algorithm first samples a batch of N𝑁Nitalic_N subjects 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each subject 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the algorithm samples support set 𝒟={𝐱j,𝐲j}𝒟superscript𝐱𝑗superscript𝐲𝑗\mathcal{D}=\{\mathbf{x}^{j},\mathbf{y}^{j}\}caligraphic_D = { bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } and a query set 𝒟={𝐱j,𝐲j}superscript𝒟superscript𝐱𝑗superscript𝐲𝑗\mathcal{D}^{\prime}=\{\mathbf{x}^{j},\mathbf{y}^{j}\}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } representing an n𝑛nitalic_n-way k𝑘kitalic_k-shot classification task and creates a copy of subject-specific parameters θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. These parameters are learned for each subject 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT separately on the support set for one or more gradient descent steps. For example, for a subject 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and one step of gradient descent, the parameters would be updated as:

ϕi1=θαθ𝒮i,𝒟(θ)superscriptsubscriptitalic-ϕ𝑖1𝜃𝛼subscript𝜃subscriptsubscript𝒮𝑖𝒟𝜃\phi_{i}^{1}=\theta-\alpha\nabla_{\theta}\mathcal{L}_{\mathcal{S}_{i},\mathcal% {D}}(\theta)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_θ - italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D end_POSTSUBSCRIPT ( italic_θ ) (1)

Finally, the model parameters are updated by evaluating gradients of loss on query-set 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using updated parameters ϕinsuperscriptsubscriptitalic-ϕ𝑖𝑛\phi_{i}^{n}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT across subjects. For example, using gradient descent, the model parameters are updated as:

θ=θβ1Ni=1Nθ𝒮i,𝒟(ϕin)=θβ1Ni=1Nϕin𝒮i,𝒟(ϕin)θϕin𝜃𝜃𝛽1𝑁superscriptsubscript𝑖1𝑁subscript𝜃subscriptsubscript𝒮𝑖superscript𝒟superscriptsubscriptitalic-ϕ𝑖𝑛𝜃𝛽1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptsubscriptitalic-ϕ𝑖𝑛subscriptsubscript𝒮𝑖superscript𝒟superscriptsubscriptitalic-ϕ𝑖𝑛subscript𝜃superscriptsubscriptitalic-ϕ𝑖𝑛\begin{split}\theta&=\theta-\beta\frac{1}{N}\sum_{i=1}^{N}\nabla_{\theta}% \mathcal{L}_{\mathcal{S}_{i},\mathcal{D}^{\prime}}(\phi_{i}^{n})\\ &=\theta-\beta\frac{1}{N}\sum_{i=1}^{N}\nabla_{\phi_{i}^{n}}\mathcal{L}_{% \mathcal{S}_{i},\mathcal{D}^{\prime}}(\phi_{i}^{n})\nabla_{\theta}\phi_{i}^{n}% \end{split}start_ROW start_CELL italic_θ end_CELL start_CELL = italic_θ - italic_β divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_θ - italic_β divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW (2)

Equation 2 is referred as meta-update. However, the term θϕinsubscript𝜃superscriptsubscriptitalic-ϕ𝑖𝑛\nabla_{\theta}\phi_{i}^{n}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT involves higher-order derivatives:

θϕin=m=1nϕim1ϕim=m=1n(Iαϕim12𝒮i,𝒟(ϕim1))subscript𝜃superscriptsubscriptitalic-ϕ𝑖𝑛superscriptsubscriptproduct𝑚1𝑛subscriptsuperscriptsubscriptitalic-ϕ𝑖𝑚1superscriptsubscriptitalic-ϕ𝑖𝑚superscriptsubscriptproduct𝑚1𝑛𝐼𝛼superscriptsubscriptsuperscriptsubscriptitalic-ϕ𝑖𝑚12subscriptsubscript𝒮𝑖𝒟superscriptsubscriptitalic-ϕ𝑖𝑚1\begin{split}\nabla_{\theta}\phi_{i}^{n}&=\prod_{m=1}^{n}\nabla_{\phi_{i}^{m-1% }}\phi_{i}^{m}\\ &=\prod_{m=1}^{n}(\mathbf{\mathit{I}}-\alpha\nabla_{\phi_{i}^{m-1}}^{2}% \mathcal{L}_{\mathcal{S}_{i},\mathcal{D}}(\phi_{i}^{m-1}))\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_I - italic_α ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ) ) end_CELL end_ROW (3)

The terms involving higher-order derivatives are costly to compute. Therefore, we adapt the first-order approximation of MAML where the algorithm ignores the terms involving higher-order derivatives and eq. 2 reduces to:

θ=θβ1Ni=1Nϕin𝒮i,𝒟(ϕin)𝜃𝜃𝛽1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptsubscriptitalic-ϕ𝑖𝑛subscriptsubscript𝒮𝑖superscript𝒟superscriptsubscriptitalic-ϕ𝑖𝑛\theta=\theta-\beta\frac{1}{N}\sum_{i=1}^{N}\nabla_{\phi_{i}^{n}}\mathcal{L}_{% \mathcal{S}_{i},\mathcal{D}^{\prime}}(\phi_{i}^{n})italic_θ = italic_θ - italic_β divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) (4)

However, the approximate algorithm is still computationally expensive as it involves nested loops. The model is pre-trained by repeating this process for multiple iterations. The process is outlined in algorithm 1.

Algorithm 1 MAML adapted for EEG classification [10, 11]
0:  p(𝒮)𝑝𝒮p(\mathcal{S})italic_p ( caligraphic_S ): Distributions over subjects
0:  α,β𝛼𝛽\alpha,\betaitalic_α , italic_β: inner-loop and meta learning rate respectively
1:  Randomly initialize model parameters, θ𝜃\thetaitalic_θ
2:  while not done do
3:     Sample batch of subjects 𝒮ip(𝒮)similar-tosubscript𝒮𝑖𝑝𝒮\mathcal{S}_{i}\sim p(\mathcal{S})caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( caligraphic_S ) of size N𝑁Nitalic_N
4:     for all 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do
5:        Sample support data points 𝒟={𝐱j,𝐲j}𝒟superscript𝐱𝑗superscript𝐲𝑗\mathcal{D}=\{\mathbf{x}^{j},\mathbf{y}^{j}\}caligraphic_D = { bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } from 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
6:        Compute adapted parameters, ϕinsuperscriptsubscriptitalic-ϕ𝑖𝑛\phi_{i}^{n}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, using n𝑛nitalic_n gradient descent steps and loss function 𝒮i,𝒟(θ)subscriptsubscript𝒮𝑖𝒟𝜃\mathcal{L}_{\mathcal{S}_{i},\mathcal{D}}(\theta)caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D end_POSTSUBSCRIPT ( italic_θ )
7:        Sample query data points 𝒟={𝐱j,𝐲j}superscript𝒟superscript𝐱𝑗superscript𝐲𝑗\mathcal{D}^{\prime}=\{\mathbf{x}^{j},\mathbf{y}^{j}\}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } from 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
8:        Compute loss 𝒮i,𝒟(ϕin)subscriptsubscript𝒮𝑖superscript𝒟superscriptsubscriptitalic-ϕ𝑖𝑛\mathcal{L}_{\mathcal{S}_{i},\mathcal{D}^{\prime}}(\phi_{i}^{n})caligraphic_L start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) for meta-update
9:     end for
10:     Perform meta-update using eq. 4
11:  end while

Transfer learning. Research communities from different domains have used this strategy for various purposes. Considering the example of the classification task, in transfer learning, we first train the model to perform n𝑛nitalic_n-way classification on a large dataset, including samples from multiple subjects. Given a new subject representing a new m𝑚mitalic_m-way classification task, we fine-tune the pre-trained model on a new task for a few iterations by updating all the parameters on samples of the new subject.

III-B Model architecture

We used a variant of EEGNet [16] architecture for our experiments. EEGNet is a compact convolutional neural network (CNN) for decoding EEG signals. We selected CNN as it has been widely used for EEG classification [1, 2]. Nevertheless, recent works have also used transformers-based models [17, 18, 19], but it is not suitable when we have limited data, as in our case.

Inspired by computer vision, most CNN-based EEG classifiers use batch-norm. However, EEG datasets are smaller, necessitating a tradeoff between batch size and the number of iterations per epoch. Batch-norm is not preferable for low batch size. However, EEG signals are sequences with long sizes, making it a perfect application of layer-norm. Moreover, layer-norm is independent of batch size. Furthermore, as our objective is to adapt to newer tasks quickly, we should refrain from using the statistics calculated during the pre-training phase. We have used an architecture similar to EEGNet; instead, we replaced batch-norm with layer-norm. We empirically show that a model with layer-norm works better than a model with batch-norm in our experiments. The model architecture is described in Table I and remained fixed in all the experiments.

TABLE I: Model Architecture
Block Layer #filters Size Others
1 Conv2D 8 1,64 padding=same
LayerNorm
DepthWiseConv2D 16 64,1 padding=valid
LayerNorm
Elu Activation
AveragePool2D 1,4 strides=1,4
DropOut p=0.25
2 SeparableConv2D 16 1,16 padding=same
LayerNorm
Elu Activation
AveragePool2D 1,8
DropOut p=0.25
Flatten
3 Dense 160

IV Experiments

In this section, we discuss the effect of components of the training pipeline, like normalization layers and training approach on the adaptability of EEG classifiers. Our implementation is based on PyTorch [20] for Python. The MAML implementation is highly inspired by the work of Long [21]. The code for experiments is available online 111The code is available at github.com/anp-scp/fast_bci/.

IV-A Dataset and pre-processing

We used Physionet’s EEG Motor Movement/Imagery Dataset [22] for our experiments. The dataset includes EEG signals of 109109109109 individuals recorded while performing the following body activities (specified as Task in the dataset):

  • Activity 1 (A1): Open and close left or right fist

  • Activity 2 (A2): Imagine opening and closing left or right fist

  • Activity 3 (A3): Open and close both fists or both feet

  • Activity 4 (A4): Imagine opening and closing both fists or both feet

The signals were recorded using an EEG device with 64646464 electrodes returning 64646464 time-domain EEG signals with a sampling frequency of 160160160160 Hz. For pre-precessing, we used MNE-python [23], a library for Python language for EEG and MEG analysis. Moreover, the dataset is readily available within the MNE-python library and can be directly used via the interface available in the library. We performed the following pre-processing steps:

  1. 1.

    We applied a “firwin” band-stop filter allowing frequencies outside 7307307-307 - 30 Hz following [24].

  2. 2.

    We segmented all the time domain signals such that each segment starts 1111 second before and ends 4444 seconds after stimulus onset.

  3. 3.

    Use the first 2222 seconds of the segments, returning the matrix of shape 64×3216432164\times 32164 × 321. Here, 64444 represents each channel, and 321321321321 represents time points (samples with time points less than 321321321321 were dropped).

For all the experiments, we used the data of subjects 1871871-871 - 87 as the training set and 8898889888-9888 - 98 as the validation set while pre-training. We evaluated fast adaptability on the data of subjects 991099910999-10999 - 109.

IV-B Batch-norm vs layer-norm

In this experiment, we investigated the fast adaptability of a model trained using the transfer-learning strategy with batch-norm vs layer-norm. For each model architecture (i.e., with batch norm and layer norm), we performed binary classification between feet and fist for activity 4 as described in the section IV-A. For the hyperparameter search, we tried learning rate [0.01,0.001]0.010.001[0.01,0.001][ 0.01 , 0.001 ] and batch size [16,32,64]163264[16,32,64][ 16 , 32 , 64 ], and we selected the one with the best validation accuracy. For both architectures, the learning rate was selected as 0.0010.0010.0010.001 and the batch size as 16161616. All the other hyperparameters remain fixed, as described in the Table III.

IV-C Adaptability across individuals

In this experiment, we investigate the adaptability of models on newer individuals. We compare two training strategies: MAML and transfer learning. For MAML, we considered data for each individual as a different task as in [11] and performed binary classification for each activity separately. For the hyperparameter search, we tried 5 and 10 adapt steps for the inner loop in MAML and 0.01 and 0.001 for learning rates, out of which we selected the one with the best validation accuracy. The best hyperparameters are reported in Table II and Table III. We then fine-tune the pre-trained models on the data of subjects 991099910999-10999 - 109 separately to evaluate the fast adaptability of the models.

TABLE II: Best pre-training hyperparameters involving MAML for experiments evaluating adaptability across individuals and activities. The values inside brackets (typeset in pink) denote the activity. For example, 0.0010.0010.0010.001 (A4) denotes that the value 0.0010.0010.0010.001 is for Activity 4444
Parameter Value
Dropout 0.25
Number of samples per class in support set 10
Number of samples per class in query set 11
Number of subjects in a batch per epoch 4
Loss function Cross entropy loss
Optimizer in inner loop Gradient Descent
Meta-optimizer Adam
Inner loop learning rate 0.01 (A1, A2, A3), 0.001 (A4)
Meta-learning rate 0.001 (A1), 0.01 (A2, A3, A4)
Inner loop adapt steps 10 (A1, A2), 5 (A3, A4)
TABLE III: Best pre-training hyperparameters involving transfer learning for experiments evaluating adaptability across individuals and activities. The values inside brackets (typeset in pink) denote the activity. For example, 0.0010.0010.0010.001 (A4) denotes that the value 0.0010.0010.0010.001 is for Activity 4444
Parameter Value
Dropout 0.25
Number of samples per class 21
Loss function Cross entropy loss
Optimizer Adam
Learning rate 0.001 (A1, A2, A3, A4)
Batch size 32 (A1, A3), 16 (A2, A4)

IV-D Adaptability across task activities

In this experiment, we investigate the adaptability of EEG decoding models on signals of newer individuals recorded on newer activities. For example, a model trained on data from Activity 1111 is adapted to the data from Activity 4444. For this experiment, we use the models pre-trained in the experiment described in Section IV-C and fine-tuned them on the data from another activity.

IV-E Evaluation metrics

We evaluated the adaptability of the models by fine-tuning the pre-trained model on the data of subjects 991099910999-10999 - 109 separately. We don’t tune hyper-parameters while fine-tuning.

In the case of experiments involving vanilla transfer learning, we considered 10101010 samples per class during fine-tuning and 11111111 per class to test the model for each subject. All the model parameters were fine-tuned with a learning rate of 0.0010.0010.0010.001 for 10101010 iterations optimized using the Adam optimizer. We used a similar paradigm for evaluation in MAML experiments, except that Gradient Descent was used to fine-tune the model parameters. In the case of MAML, the learning rate was the same as the inner loop learning rate during pre-training for experiments evaluating adaptability across individuals and 0.0010.0010.0010.001 for experiments evaluating adaptability across task activities. For this, we iterated the inner loop without iterating over the outer loop in algorithm 1, avoiding any meta-update. The number of samples in the support set was set to 10101010 per class, and the number of samples in the query set to 11111111.

Finally, we reported the accuracy on the test samples (query set for MAML) averaged across subjects 991099910999-10999 - 109. As the amount of data we had was small, we performed the fine-tuning for 100100100100 runs and reported the mean and standard deviation of the accuracy.

Refer to caption
(a) Trend of training accuracy during fine-tuning
Refer to caption
(b) Trend of testing accuracy during fine-tuning
Figure 1: Performance analysis (in accuracy) of models with batch-norm and with layer-norm
Refer to caption
(a) Trend of test accuracy during fine-tuning on activity 1111
Refer to caption
(b) Trend of test accuracy during fine-tuning on activity 2222
Refer to caption
(c) Trend of test accuracy during fine-tuning on activity 3333
Refer to caption
(d) Trend of test accuracy during fine-tuning on activity 4444
Figure 2: Performance analysis (in accuracy) of training strategies when adapted to newer individuals
Refer to caption
(a) Adapting from Activity 1111 to 2222
Refer to caption
(b) Adapting from Activity 1111 to 3333
Refer to caption
(c) Adapting from Activity 1111 to 4444
Refer to caption
(d) Adapting from Activity 2222 to 1111
Refer to caption
(e) Adapting from Activity 2222 to 3333
Refer to caption
(f) Adapting from Activity 2222 to 4444
Refer to caption
(g) Adapting from Activity 3333 to 1111
Refer to caption
(h) Adapting from Activity 3333 to 2222
Refer to caption
(i) Adapting from Activity 3333 to 4444
Refer to caption
(j) Adapting from Activity 4444 to 1111
Refer to caption
(k) Adapting from Activity 4444 to 2222
Refer to caption
(l) Adapting from Activity 4444 to 3333
Figure 3: Performance analysis (test accuracy) of training strategies when model trained on one activity is adapted to other activities

V Results and Discussion

In this paper, we evaluated the fast adaptability of motor movement and imagery EEG decoding models from the normalization technique and training strategy perspective. In the upcoming subsection, we discuss the results of all the experiments we performed.

V-A Batch-norm vs layer-norm

We compared the performance of models with batch-norm and layer-norm when adapted to the EEG signals of newer individuals. Though batch-norm is widely used with CNN models, we can observe in Figure 1 that the model with batch-norm layer cannot adapt to the signals of newer individuals, whereas the model with layer-norm layer adapts well within 10 iterations. Moreover, the training and testing accuracy degrades as we continue fine-tuning the model with the batch-norm. Such decline can be attributed to the problem of variations in EEG signals across each individual or activity [1, 3]. Hence, it is essential that we calculate data statistics for normalization from the new data and not from the training data. Even though the statistics change during the fine-tuning phase, it is not favorable for fast adaptability, especially when we have fewer samples. For practical applications, we cannot have large samples for quick calibration, and these results show that layer-norm is a better choice for normalization than batch-norm for EEG classification in such scenarios.

V-B Adaptability across individuals

We further compared the performance of models trained with two popular knowledge transfer strategies, MAML and transfer learning, when adapted to signals of newer individuals. Even though MAML was developed for adaptation, our results show that is not necessarily the case for EEG decoding. In Figure 2, we can observe that as we fine-tune the models on newer individuals, the model trained with transfer learning adapts considerably faster than the model trained with MAML for all the activities mentioned in Section IV-A. Notably, for activity 2222, the model trained with transfer learning attains its maximum at iteration 2222. Also, the model’s initial parameters (at iteration 00 during fine-tuning) perform better when trained with transfer learning than MAML. The results here are opposite to the one reported by Li et al[11], where MAML was found to perform better. Moreover, our approach attains similar performance without excluding a significant dataset section, as in [11]. We also found that when models are adapted to newer individuals for the same set of labels, the model trained with transfer learning performs reasonably well within 5555 iterations, as evident in Figure 2.

Finally, in Table IV, we compare the performance of our approach and the ones reported by other recent state-of-the-art approaches. Few cells in the table are left blank as the authors did not perform corresponding experiments. None except us performed the experiments on Activity 1111 and Activity 3333. For activity 2222, our approach performed the best when tested without adaptation. Moreover, after adaptation, our approach performed slightly lower than EEGSym [8], even though the amount of pre-training data for EEGSym [8] is notably higher than our pre-training data. For activity 4444, our approach performed very close to Li et al.’s [11] approach without filtering any samples. Li et al[11] reached an accuracy of 79.779.779.779.7 after filtering nearly 80%percent8080\%80 % of samples. In summary, our approach with a simple model and training strategy highly competes with the existing state-of-the-art models on Physionet’s EEG Motor Movement/Imagery Dataset [22] specifically in a low data environment.

TABLE IV: Comparison of our performance (test accuracy) and the performance reported by other recent state-of-the-art approaches on Physionet’s EEG Motor Movement/Imagery Dataset [22]. Certain cells are blank, as the corresponding authors did not reported them.
Activity Model Before adaptation After adaptation
1 Ours 83.18 ±plus-or-minus\pm± 0 85.88 ±plus-or-minus\pm± 0.6
2 DG-CRAM [4] 74.71 ±plus-or-minus\pm± 4.19 -
EEGSym [8] - 88.6 ±plus-or-minus\pm± 9.0
Li et al[11] - 80.6
s-CTrans [17] 83.31 -
Ours 85.91 ±plus-or-minus\pm± 0 86.28 ±plus-or-minus\pm± 0.63
3 Ours 72.73 ±plus-or-minus\pm± 0 79.81 ±plus-or-minus\pm± 1.07
4 Li et al[11] - 79.7
Ours 67.73 ±plus-or-minus\pm± 0 71.22 ±plus-or-minus\pm± 1.2

V-C Adaptability across task activities

Figure 3 compares the models’ performance with MAML and transfer learning when the model trained on signals of activity i𝑖\mathit{i}italic_i is fine-tuned to activity j𝑗\mathit{j}italic_j (where ij𝑖𝑗\mathit{i}\neq\mathit{j}italic_i ≠ italic_j). We can observe that in all the cases, the model trained with transfer learning performs better. However, the performance is low when the activities are not similar (The activity set 1,2 is different from the activity set 3,4 as they involve different body movements). Nevertheless, it is interesting that in such cases, models trained with MAML either cannot cross the 50%percent5050\%50 % chance level or remain very near the chance level, as seen in fig. 3(b), Figure 3(c), Figure 3(e), Figure 3(f), Figure 3(g) and Figure 3(h). It is only in the case when the models trained on activity 4444 are adapted to other activities that the initial parameters of the model with MAML provide better performance far from the chance level (Figure 3(j) and Figure 3(k)). However, as we fine-tune the models for 10 steps, transfer learning adapts quickly and beats MAML. The better adaptability of EEG decoding models trained with transfer learning, as seen in our experiments, shows that transfer learning remains the better option for cross-individual, cross-activity motor movement and imagery EEG decoding.

VI Conclusion

This work provides insights on evaluating models and training strategies for fast adaptability for across-individual and across-activity motor movement and imagery EEG decoding. We also suggest an architectural change in the models for better adaptability. Our proposed architecture i.e. CNN with layer-norm and transfer learning shows faster adaptation than recently proposed MAML approach, and thereby will be suitable for fast calibration of real-time BCI systems.

References

  • [1] H. Altaheri, G. Muhammad, M. Alsulaiman, S. U. Amin, G. A. Altuwaijri, W. Abdul, M. A. Bencherif, and M. Faisal, “Deep learning techniques for classification of electroencephalogram (eeg) motor imagery (mi) signals: a review,” Neural Computing and Applications, vol. 35, no. 20, pp. 14 681–14 722, Jul 2023. [Online]. Available: https://doi.org/10.1007/s00521-021-06352-5
  • [2] A. Craik, Y. He, and J. L. Contreras-Vidal, “Deep learning for electroencephalogram (eeg) classification tasks: a review,” Journal of Neural Engineering, vol. 16, no. 3, p. 031001, apr 2019. [Online]. Available: https://dx.doi.org/10.1088/1741-2552/ab0ab5
  • [3] V. Jayaram, M. Alamgir, Y. Altun, B. Schölkopf, and M. Grosse-Wentrup, “Transfer learning in brain-computer interfaces,” IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 20–31, 2016.
  • [4] D. Zhang, K. Chen, D. Jian, and L. Yao, “Motor imagery classification via temporal attention cues of graph embedded eeg signals,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 9, pp. 2570–2579, 2020.
  • [5] S. U. Amin, M. Alsulaiman, G. Muhammad, M. A. Bencherif, and M. S. Hossain, “Multilevel weighted feature fusion using convolutional neural networks for eeg motor imagery classification,” IEEE Access, vol. 7, pp. 18 940–18 950, 2019.
  • [6] X. Zhu, P. Li, C. Li, D. Yao, R. Zhang, and P. Xu, “Separated channel convolutional neural network to realize the training free motor imagery bci systems,” Biomedical Signal Processing and Control, vol. 49, pp. 396–403, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1746809418303264
  • [7] K. Zhang, N. Robinson, S.-W. Lee, and C. Guan, “Adaptive transfer learning for eeg motor imagery classification with deep convolutional neural network,” Neural Networks, vol. 136, pp. 1–10, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608020304305
  • [8] S. Pérez-Velasco, E. Santamaría-Vázquez, V. Martínez-Cagigal, D. Marcos-Martínez, and R. Hornero, “Eegsym: Overcoming inter-subject variability in motor imagery based bcis with deep learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 30, pp. 1766–1775, 2022.
  • [9] J. Vanschoren, Meta-Learning.   Cham: Springer International Publishing, 2019, pp. 35–61. [Online]. Available: https://doi.org/10.1007/978-3-030-05318-5_2
  • [10] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17.   JMLR.org, 2017, p. 1126–1135.
  • [11] D. Li, P. Ortega, X. Wei, and A. Faisal, “Model-agnostic meta-learning for eeg motor imagery decoding in brain-computer-interfacing,” in 2021 10th International IEEE/EMBS Conference on Neural Engineering (NER), 2021, pp. 527–530.
  • [12] K. Miyamoto, H. Tanaka, and S. Nakamura, “Meta-learning for emotion prediction from eeg while listening to music,” in Companion Publication of the 2021 International Conference on Multimodal Interaction, ser. ICMI ’21 Companion.   New York, NY, USA: Association for Computing Machinery, 2021, p. 324–328. [Online]. Available: https://doi.org/10.1145/3461615.3486569
  • [13] N. Banluesombatkul, P. Ouppaphan, P. Leelaarporn, P. Lakhan, B. Chaitusaney, N. Jaimchariyatam, E. Chuangsuwanich, W. Chen, H. Phan, N. Dilokthanakul, and T. Wilaiprasitporn, “Metasleeplearner: A pilot study on fast adaptation of bio-signals-based sleep stage classifier to new individual subject using meta-learning,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 6, pp. 1949–1963, 2021.
  • [14] L. Chen, Z. Yu, and J. Yang, “Spd-cnn: A plain cnn-based model using the symmetric positive definite matrices for cross-subject eeg classification with meta-transfer-learning,” Frontiers in Neurorobotics, vol. 16, 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fnbot.2022.958052
  • [15] J. Li, F. Wang, H. Huang, F. Qi, and J. Pan, “A novel semi-supervised meta learning method for subject-transfer brain–computer interface,” Neural Networks, vol. 163, pp. 195–204, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608023001740
  • [16] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, no. 5, p. 056013, jul 2018. [Online]. Available: https://dx.doi.org/10.1088/1741-2552/aace8c
  • [17] J. Xie, J. Zhang, J. Sun, Z. Ma, L. Qin, G. Li, H. Zhou, and Y. Zhan, “A transformer-based approach combining deep learning network and spatial-temporal information for raw eeg classification,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 30, pp. 2126–2136, 2022.
  • [18] D. Kostas, S. Aroca-Ouellette, and F. Rudzicz, “Bendr: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data,” Frontiers in Human Neuroscience, vol. 15, 2021. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fnhum.2021.653659
  • [19] H.-Y. S. Chien, H. Goh, C. M. Sandino, and J. Y. Cheng, “Maeeg: Masked auto-encoder for eeg representation learning,” in NeurIPS Workshop, 2022. [Online]. Available: https://arxiv.longhoe.net/abs/2211.02625
  • [20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [21] L. Long, “Maml-pytorch implementation,” https://github.com/dragen1860/MAML-Pytorch, 2018.
  • [22] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000.
  • [23] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, R. Goj, M. Jas, T. Brooks, L. Parkkonen, and M. S. Hämäläinen, “MEG and EEG data analysis with MNE-Python,” Frontiers in Neuroscience, vol. 7, no. 267, pp. 1–13, 2013.
  • [24] J. L. Ulloa, “The control of movements via motor gamma oscillations,” Frontiers in Human Neuroscience, vol. 15, 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fnhum.2021.787157