Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based
Dim Object Detection

Zixing Li, Chao Yan, Zhen Lan, Dengqing Tang, Xiaojia Xiang, Han Zhou, Jun Lai Zixing Li, Zhen Lan, Dengqing Tang, Xiaojia Xiang, Han Zhou, Jun Lai are affiliated with the College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China, (e-mail: {lizixing16, lanzhen19, xiangxiaojia, tangdengqing09, zhouhan, laijun} @nudt.edu.cn). Zixing Li and Chao Yan contributed equally to this work.(Corresponding author: Dengqing Tang.)Chao Yan is affiliated with the College of Automation Engineering, Nan**g University of Aeronautics and Astronautics, Nan**g 211106, China, (e-mail: {[email protected]).

Abstract

Advanced cognition can be extracted from the human brain using brain-computer interfaces. Integrating these interfaces with computer vision techniques, which possess efficient feature extraction capabilities, can achieve more robust and accurate detection of dim targets in aerial images. However, existing target detection methods primarily concentrate on homogeneous data, lacking efficient and versatile processing capabilities for heterogeneous multimodal data. In this paper, we first build a brain-eye-computer based object detection system for aerial images under few-shot conditions. This system detects suspicious targets using region proposal networks, evokes the event-related potential (ERP) signal in electroencephalogram (EEG) through the eye-tracking-based slow serial visual presentation (ESSVP) paradigm, and constructs the EEG-image data pairs with eye movement data. Then, an adaptive modality balanced online knowledge distillation (AMBOKD) method is proposed to recognize dim objects with the EEG-image data. AMBOKD fuses EEG and image features using a multi-head attention module, establishing a new modality with comprehensive features. To enhance the performance and robust capability of the fusion modality, simultaneous training and mutual learning between modalities are enabled by end-to-end online knowledge distillation. During the learning process, an adaptive modality balancing module is proposed to ensure multimodal equilibrium by dynamically adjusting the weights of the importance and the training gradients across various modalities. The effectiveness and superiority of our method are demonstrated by comparing it with existing state-of-the-art methods. Additionally, experiments conducted on public datasets and system validations in real-world scenarios demonstrate the reliability and practicality of the proposed system and the designed method. The dataset and the source code can be found at: https://github.com/lizixing23/AMBOKD.

Index Terms:

object detection, EEG, multimodal learning, online knowledge distillation, adaptive modality balancing

I Introduction

In recent years, considerable progress has been made in the field of computer vision, primarily due to rapid advancements in deep learning [1, 2]. Remarkable results in different visual tasks, such as object recognition, image generation, and visual localization, have been achieved by combining deep learning with high-performance computers, well-designed neural networks, and large datasets [3]. However, accurately and robustly detecting dim objects in aerial images remains a difficult task because of factors such as cluttered backgrounds, varying observing angles, and small object scales [4]. Furthermore, the performance of dim object detector is further restricted by the limited training samples of sensitive objects in aerial images.

Refer to caption — Figure 1: Data extraction and processing procedure based on adaptive modality balanced online knowledge distillation (AMBOKD) method.

A promising avenue to address the above limitations is offered by brain-computer interfaces (BCIs). This technique is able to decode brain activity in response to events, providing insights into human cognitive processes [5]. Leveraging the prior knowledge and advanced cognitive abilities of humankind, BCIs can compensate the shortcomings of comuputer vision systems and achieve more robust and accurate recognition capabilities. Existing studies have demonstrated the potential of using event-related potential (ERP) detection for target recognition tasks. [6, 7, 8]. By detecting and interpreting ERPs elicited from electroencephalogram (EEG) signals, these works have shown effectiveness in target recognition tasks across different scenarios [6]. However, due to the low noise of the EEG signals, the accuracy of EEG-based target recognition approaches still needs to be enhanced.

To overcome the inherent shortcomings of computer vision-based approaches and BCIs, a straightforward idea is to combine these two approaches. Recently, multimodal learning has emerged as a powerful tools to incorporate the advantages of computer vision and EEG modalities by ingratiating diverse data and essential information [9]. Researchers have investigated multimodal learning through diverse strategies using various structures such as early fusion [10] at the feature level and late fusion [11] employing logistic regression. However, most current studies mainly focus on feature fusion or loss function, without thoroughly considering the relationships between various modalities, thereby impeding the promotion of mutual learning and synergistic effects among modalities [12]. Thus, extracting critical information from multimodal data (i.e., visual and EEG data) and improving recognition performance remains an ongoing challenge.

Knowledge distillation (KD), a widely used technique for knowledge transfer, offers a solution to the aforementioned challenge. KD provides a paradigm where the student model can learn more information from the teacher model, rather than relying solely on the correctness of the true labels. This kind of method has similarities with human learning, deriving various algorithms in the field of multimodal learning, such as cross-modal [13] and multi-teacher [14] KD, which improves performance and generalization while reducing the model parameters [15]. In order to simplify the distillation procedure and circumvent the disadvantages associated with conventional KD in terms of training cost and pipeline complexity, online knowledge distillation (OKD) has been proposed to facilitate mutual learning between models in the end-to-end training process [16] [17]. Nevertheless, most current OKD approaches depend on a single modality and are restricted to supporting homogeneous networks. These approaches fail to address the imbalance issues arising from modal differences during the online learning process, lacking the ability to extract and integrate multimodal information.

In this study, a brain-eye-computer based object detection system is built for detecting dim object in aerial images under few-shot conditions, which is able to leverage the strengths of humans in rapid cognitive and computer vision in data processing. This system uses the region proposal networks to detect suspicious object region, evokes the ERP signal with the eye-tracking-based slow serial visual presentation (ESSVP) paradigm, and consturcts EEG-image data pairs with eye movement data and recognizes them. In particular, an adaptive modality balanced online knowledge distillation (AMBOKD) method is proposed to process the EEG-image data and recognize the target image. As shown in Fig. 1, the experimental procedure begins with the acquisition of EEG and image data, which are input into their respective encoders. The outputs of these encoders are fused by the fusion module, which is considered the third modality. Subsequently, the visual, EEG, and fusion modalities exchange roles between teacher and student models. For example, when visual modality acts as the student model, EEG modality and fusion modality act as the teacher models. This setup facilitates mutual learning and collective optimization using the OKD method. During mutual learning, the adaptive modality balancing (AMB) module is designed to facilitate parameter optimization for all modalities by dynamically balancing the influence weights and training optimization level of each modality. Consequently, the proposed approach is well suited for multimodal heterogeneous tasks, enabling the fusion of 2D EEG signals and 3D visual image data, leading to robust and efficient performance in dim object recognition.

To the best of our knowledge, this study is the pioneering effort in addressing the challenges of dim object detection in aerial images using heterogeneous multimodal data. The main contributions of this study can be summarized as follows:

•

A brain-eye-computer based object detection system is established to obtain the subject’s attention region image and EEG data. This system is able to fully untilize the advantages of multi modalities, and detect the dim target in aerial images under few-shot conditions.
•

An AMBOKD method is proposed to fuse the multimodal data and enable simultaneous end-to-end mutual learning for target recognition. Under this method, an AMB module is introduced to adaptively balance the influence weights and training gradients of each modality, so as to ensure comprehensive parameter optimization.
•

A multimodal ESSVP dataset is built with 224×224 size RGB images and 59-channel EEG data. This dataset contains more than 13,000 paired samples of EEG and images, presenting promising prospects for the further development of multimodal learning methods in target detection tasks.
•

The superiority, robustness, and generalizability of our method is demonstrated by comparing it with state-of-the-art methods in a series of experiments. Our method not only improves the performance of the fusion model but also enhances the capabilities of both EEG and visual models, highlighting its potential in the field of multimodal learning.

This paper is structured as follows: Section. II discusses the related work about this study. Section. III presents the design of the built system and the ESSVP dataset in detail. Section. IV illustrates the proposed target recognition method. Section. V presents the experimental results and analyzes the performance. Section. VI draws conclusions and possible directions for future work.

II Related Work

II-A Brain-Computer Based Object Recognition

Computer-based target recognition approaches has been fully developed because of the rapid development of deep learning. The well-known neural network algorithms commonly employed in this domain include ResNet [18], MobileNetV2 [19], VGG [20], DenseNet [21], and EfficientNet [22]. Notably, ResNet [18], which can learn residual map**s, enables the network to effectively capture and represent complex features, leading to high-accuracy classification performance. EfficientNet [22] focuses on achieving high performance with computational efficiency, achieved through a meticulously designed network architecture and scaling method. However, the existing algorithms still suffer from false alarms and missed detection problems. These algorithms are highly dependent on the data set due to the influence of factors such as the target environment, training data and noise.

In recent years, EEG-based object recognition approaches have attracted considerable attention because of their ability to extract human cognition [23, 24]. These approaches depend on the analysis of ERP signals generated when subjects identify a target, providing enhanced robustness in intricate environments and minimizing the need for large amounts of image data [25, 7, 26]. The rapid serial visual presentation (RSVP) paradigm evokes ERP signals employed for target recognition by presenting visual image sequences as stimuli. For instance, Bigdely-Shamlo et al. [27] conducted an RSVP experiment where subjects were instructed to identify images containing airplanes at a frame rate of 12 Hz. They achieved high accuracies on 128-channel EEG data using independent component analysis (ICA), and reported promising results for single-trial classification. Lan et al. [7] proposed a multi-attention convolutional recurrent model for ERP detection, facilitating the identification of target images within RSVP sequences [28]. In addition, Fan et al. [26] introduced an asynchronous visual evoked paradigm (AVEP) based on RSVP, and proposed a deep learning method for detecting dim objects in satellite images. However, EEG signals are different on individual subjects and are prone to noise interference during signal extraction, thereby necessitating further improvements.

To harness the benefits of EEG and computer vision approaches, the fusion of EEG and image features has been investigated to accomplish efficient and robust object recognition. Barngrover et al. [29] developed a specific brain-computer Interface (BCI) system that addresses the challenge of target mine recognition inside-scan sonar images by fusing image and EEG signal features. Huang et al. [30] proposed a Bayesian HV-CV retrieval framework (BHCR) that combines human and computer vision using Bayesian approaches. The RSVP experimental paradigm was used to recognize targets, leading to effective retrieval of image databases. Minor et al. [31] introduced a multimodal neural network that combines EEG and image features. By fusing these modalities, they achieve high classification results in object recognition tasks. Previous studies have indicated the potential of integrating EEG and visual modalities to enhance the efficiency and robustness of recognition systems.

In line of this, our study aims to achieve dim object detection on aerial images by constructing the brain-eye-computer based object detection system with the ESSVP paradigm, which is build upon the RSVP and AVEP paradigms. This system detects the suspicious object region with the region proposal networks and employs eye-tracking technology to simultaneously extract the attention region and EEG signal fragment of the subject from the image. The proposed multimodal fusion approach outlined in this study then uses these two modal data sources as inputs.

II-B Multimodal Learning

Modeling and analyzing data from various sensory modalities, including images, speech, text, and EEG signals, are key aspects of multimodal learning [32, 33, 34]. Significant attention has been garnered by multimodal learning in the fields of artificial intelligence and machine learning in recent years, resulting in remarkable progress [9, 35]. Currently, different multimodal applications, such as human-computer interaction [33, 36], natural language translation [37, 38], and computer vision [39, 40, 34], are being applied in our lives.

Due to the development of deep learning, researchers propose novel loss functions or training approaches to optimize multimodal models. For instance, Nagrani et al. [41] introduced a transformer-based architecture that employs fusion bottlenecks for modality fusion at multiple layers. This method enables each modality to capture crucial information while efficiently sharing necessary information, leading to enhanced fusion performance and reduced computational cost. Du et al. [32] modeled the relationships between brain, visual, and linguistic features using multimodal deep generative models. They maximized both intra-modality and inter-modality mutual information regularization terms. Their approach addresses limitations such as the under-exploitation of multimodal knowledge and the scarcity of training data. Yang et al. [40] proposed a multimodal fusion approach for remote sensing image-audio retrieval tasks. They converted audio inputs to text and fused them with the text information to obtain a fusion representation. They optimized the common retrieval space using triplet loss, semantic loss, and consistency loss. Experimental findings on multiple datasets indicate the effectiveness of their approach.

Most of the current methods in this field primarily focus on feature fusion and loss function design when develo** multimodal fusion algorithms, thereby overlooking the potential benefits of knowledge transfer and mutual learning between different modalities. In contrast, we propose a dynamic hybrid fusion approach that integrates multimodal features and dynamically trains all branches in the model using an OKD method. This mutual learning approach for multimodal fusion aims to fully leverage the benefits of the visual and cognitive domains and extract crucial information effectively.

II-C Knowledge Distillation

KD has attracted considerable attention because of its capacity to compress models. This technique involves training a small student model by imitating the output distribution of a large teacher model [15]. KD has been extensively employed in different fields, including computer vision [42, 43], natural language processing [44, 45], and emotion recognition [36]. These appoarches can be classified into offline distillation and online distillation according to the distillation schemes [46]. Offline distillation typically follws a two-stage training process where a trained teacher model guides the training of a student model. However, this approach is expensive and knowledge transfer is unidirectional. In contrast, online approaches enable end-to-end training, allowing for simultaneous learning of teacher and student models while reducing training [47].

In the field of multimodal learning, attention has been drawn to offline knowledge distillation due to its knowledge transfer capabilities. For instance, Zhang et al. [48] proposed a visual-to-EEG cross-modal KD method that enhances continuous EEG prediction using dark knowledge from visual modality. Multi-teacher KD methods, including AvgMKD [49], CA-MKD [50], and EMKD [51], have been proposed to combine knowledge from various modalities and train the student model.

Current online knowledge distillation approaches have achieved significant progress with multimodal homogeneous data. For instance, Zhang et al. [47] introduced DML, demonstrating that student models can learn from each other through their predictions in deep mutual learning. Guo et al.[52] presented an OKD approach using collaborative learning with a weighted ensemble logit distribution. Li et al. [53] proposed an innovative embedded OKD approach that surpasses existing OKD approaches in image classification tasks. This approach leverages ensemble information, overall feature representations from peer networks, and logits to fully exploit the potential of networks.

However, there remains a research gap in online knowledge distillation methods for multimodal heterogeneous data, due to the challenges in extracting and fusing heterogeneous data features. Thus, we propose the AMBOKD method, which fully extracts features from both 2D EEG data and 3D image data, and treats the features output by the fusion model as a single modality for mutual learning. This method adaptively adjust the influences weights and the optimization level of each modality, leading to fully optimization of the parameters and surperior performance of the fusion model, thereby opening up new application scenarios for multimodal OKD.

III System Design and Data Collection

III-A System Design

Aiming at the problem of dim object detection in aerial images under few-shot conditions, we propose a brain-eye-computer based object detection system. As shown in Fig. 2, the system is designed according to EEG, vision, and eye movement properties. Firstly, the computer processes the image data through the feature extractor and the region proposal network (RPN) to obtain the image with the pre-detection box (see Section III-B). Subsequently, this part of the image is presented in the display screen through the eye-tracking-based slow serial visual presentation (ESSVP) paradigm (see Section III-C), which in turn induces the subject to generate the corresponding EEG signals. Through the real-time eye movement data recorded by the eye tracker and the time-stamp synchronization signal of the trigger box, the computer simultaneously extracts the EEG data recorded by the EEG cap and the image of the subject’s attention area at the corresponding time. After that, the paired data will be preprocessed, and recognized by the proposed AMBOKD method (see Section IV) . Finally, according to the eye movement data, the position of the target image in the original image is found for target localization, so as to relize the target detection task.

The hardwares of the system primarily serves the roles of sensor data acquisition and algorithm execution, comprising a 30-inch display with a 2K resolution, a computer equipped with an NVIDIA GeForce GTX 2070s GPU, a 64-channel EEG cap, signal amplifier, trigger box, router, eye tracker, and various data cables. The display and computer are key components, utilized for the playback of ESSVP paradigm and the real-time storage and processing of EEG, eye movement, and image data. The displayer is positioned 50 $\sim$ 70 cm directly in front of the subjects to ensure maximal elicitation of ERP signals and more accurate collection of eye movement data. Positioned directly below the display at a 15-degree angle towards the user’s eyes, the eye tracker effectively captures gaze positions, transmitting them to the computer at a frequency of 50Hz. EEG data is collected via 64 wet electrodes in the EEG cap, with the signals amplified by a connected amplifier before being transmitted to the computer. The trigger box synchronizes various types of event data (e.g., auditory, visual, and program outputs) with neurophysiological data with high event precision ( $<1$ ms), serving as a cornerstone for subsequent data analysis by synchronizing EEG data with event-triggered label signals.

III-B Suspicious Region Detection

The RPN is an integral component in deep learning and computer vision, particularly in object detection tasks [54]. Operating by processing a feature map, the RPN employs a sliding window approach across various locations, scales, and aspect ratios using an anchor boxes. This network evaluates the likelihood of each anchor enclosing an object and adjusts its position and dimensions accordingly, generating precise region proposals.

As illustrated in Fig. 2, the designed system uses a pretrained ResNet50 feature encoder to obtain feature maps of image data, and employs RPN to generate region proposals, which are denoted with black boxes. The anchors used have scales of 16, 32, 48, 64, and 80, with aspect ratios of 0.5, 0.75, 1.0, 1.5, and 2.0. We employ UAVs to capture images of toy models in open outdoor settings, forming the training set. Meanwhile, the test set comprises images taken in complex real-world environments. The effectiveness of the employed feature extraction method and the RPN network is validated by the results, as depicted in Fig. 3. It is evident that, despite variations in scene environments and targets, the RPN network is capable of precisely delineating suspicious objects, thereby effectively supporting the continuation of subsequent experiments.

III-C ESSVP Paradigm

To elicit ERPs more efficiently and obtain the EEG-image data, we propose the ESSVP paradigm according to the RSVP [28] and the AVEP [26] experimental paradigm. In this paradigm, visual stimuli are presented at a slow and controlled pace, providing the subjects sufficient time to search for specific targets. By integrating eye movement technology, we can accurately track the observer’s gaze and determine the specific areas of interest during the dim target recognition task.

The ESSVP paradigm firstly presents the experiment guidance for 1 min, and then the target example is shown to the subject for 3 $\sim$ 4 s, as illustrated in Fig. 2. In this formal experiment, participants view 16 sequences of stimuli, each containing 50 images. These sequences are evenly divided into two sessions according to the region marking method. In each session, the first six sequences display images of toy models such as armored vehicles and airplanes, taken in simple scenes, with each image shown for 3 s. The last two sequences in each session present real images of real models, taken in complex scenes, with each image displayed for 4 s. The target in the first six sequences is the armored model, whereas in the last two sequences, the target is the vehicle. The different types of images used in the ESSVP paradigm are illustrated in Fig. 4.

The probability of the target image being included in each sequence is 40%. 1 $\sim$ 2 nontarget images are randomly inserted between each target image to avoid the attentional blink caused by successive targets. During the stimulus presentation, we instruct subjects to actively search for the target among the candidate regions in the image, which is generated by the region proposal network in advance. Each candidate region must be fixated for 0.5 s. To prevent excessive visual fatigue due to prolonged task engagement, subjects are allowed to rest after each sequence, and the duration of the rest is determined by the subjects themselves.

The eye tracker continuously records the eye movement data of the subjects throughout the experiment. Specifically, the candidate box and image sequence data will be recorded if the subject fixates on a candidate area for more than 0.3s. In addition, to record the EEG signals of subjects during the corresponding time, the fixation event will be sent to the EEG signal acquisition system. Finally, the EEG-image data for the dim object task is generated collecting the EEG signals captured during the fixation periods and the image data from the subject’s attention area.

The experimental environment for ESSVP is illustrated in Fig. 5. Specifically, the EEG signals sampled at a rate of 1000 Hz are collected through 64 wet electrodes adhering to the 10 $\sim$ 20 standard [55] on the EEG cap. The impedance of these electrodes is maintained below 10k. Throughout the experiment, subjects are seated comfortably in a quiet environment, positioned approximately 60 cm away from the display screen, facing the displayed image. To ensure signal quality, and we instruct participants to maintain stability and minimize body and head movements while observing the visual presentation paradigm. This directive aims to minimize potential noise interference to the EEG signals.

III-D Data Collection and Preprocessing

The ESSVP dataset used in this study includes EEG data from 10 subjects, all of whom are college students aged between 22 and 26, with normal vision and no history of mental illness. All subjects were provided with a clear understanding of the experimental procedure, task requirements, and brief information about the target characteristics before the experiment. In addition, the subjects signed an informed consent form. Ethical approval for the experiment was obtained from the relevant committee.

This dataset consists of 13405 samples, including 3880 positive samples with the target image and 9525 negative samples without the target image. Each sample includes EEG data from 59 electrode channels, as 5 out of the 64 electrodes are redundant and therefore excluded. The EEG data are filtered using a 2 $\sim$ 30 Hz bandpass filter and downsampled from the original 1000 Hz to a rate of 250 Hz. The data is then segmented into 1.2 s (-500 $\sim$ 700 ms) samples based on the trigger. Baseline correction is applied by employing the data from the first 200 ms interval as a reference to subtract the mean activity level. The image data used has a minimum pixel size of $150\times 150$ , corresponding to the center of the focused candidate box. All images are resized to $224\times 224$ pixels to standardize the input for image network. The images in the validation set are preprocessed by introducing noise to simulate real-world conditions and evaluate the algorithm’s robustness, as shown in Fig. 6. Specifically, one-third of the images are augmented with 0.2 Gaussian noise and another third of the images are augmented with 0.2 salt and pepper noise.

The EEG data are represented as a set of labeled samples $\left\{\left(M_{e}^{i}\in R^{59\times 300}\right)\mid i=1,2,\ldots,N\right\}$ after the preprocessing steps. Each sample comprises a matrix with dimensions 59 (number of electrode channels) by 300 (number of time points), where the time points correspond to the segmented 1.2 s intervals that have undergone the bandpass filtering and baseline correction steps.

The image data are represented as a set of samples $\left\{\left(M_{v}^{i}\in\mathbb{R}^{3\times 224\times 224}\right)\mid i=1,2,% \ldots,N\right\}$ . Each sample comprises a tensor with dimensions 3 (number of color channels) by 224 (height) by 224 (width), corresponding to the size of the input images in the dataset.

Both the labeled EEG and image modality samples were collected as paired samples, forming sample pairs $X=\left\{\left(M_{v}^{i},M_{e}^{i},y_{i}\right)\mid i=1,2,\ldots,N\right\}$ in the constructed ESSVP dataset, where $N$ is the total number of paired data, and $y_{i}$ is the corresponding ground truth label. These sample pairs serve as the foundation for the analysis and assessment of the proposed approach for dim target recognition in aerial images.

IV Methodology

The framework of adaptive modality balanced online knowledge distillation (AMBOKD) method is illustrated in Fig. 7. Initially, AMBOKD receives paired data $X$ and extract EEG and visual features through the visual encoder and EEG encoder. These features are subsequently passed through the multi-head self-attention module to generate the fusion features, and the features from each domain are sent to their respective classifiers. The features extracted from the visual, EEG, and fusion domains are classified and employed to compute the losses of the three modalities using the cross-entropy (CE) and KD loss functions. Subsequently, the three modalities take turns as students and teachers to compute the total loss, the weight of the teacher influence, and the dynamic student gradient for parameter updating. The processes of the feature extraction, feature fusion, the online knowledge distillation, and the adaptive modality balancing are detailed as follows.

IV-A Extraction of The Modality Representation

The EfficientNet architecture [22] is employed as the visual encoder and MCGRAM [8] as the EEG encoder to extract the specific information of the visual images and EEG signals in dim target recognition. The focus of this initial extraction step is to capture the pertinent information from each modality effectively. The proposed approach maximizes the extraction of valuable information from visual images and EEG signals by designing dedicated networks for each modality. In addition, the logits obtained from the EEG and visual domains denote the independent recognition capability of the uni-encoder model.

IV-A1 Visual Encoder

EfficientNet [22], a well-designed convolutional neural network, has demonstrated outstanding advancements in accuracy and computational efficiency. This achievement is attributed to the effective balance of network depth, width, and resolution, resulting in enhanced recognition performance. In this study, EfficientNet-B0 was used because of its high efficiency, achieving comparable accuracy to ResNet-50 while requiring fewer parameters and floating point operations for the same input size.

Three main components constitute EfficientNet-B0: convolutional neural networks, batch normalization functions, and mobile inverted bottleneck convolution, inspired by the MobileNet [19] concept. MBConv consists of three types of layers: convolutional layers, depth-wise convolutional layers, and squeeze-and-excitation networks (SENet) [56]. The convolutional layer expands and compresses channels, whereas the depth-wise convolutional layer reduces the parameter count. However, SENet focuses on channel relationships by assigning importance weights to various channels.

The parameters of EfficientNet in AMBOKD were initialized using pre-training on the ImageNet dataset to ensure optimal performance and efficiency. This enables the leverage of knowledge gained from the ImageNet dataset for improved results in our specific task.

The effective representation of visual images from the visual encoder can be easily extracted with the help of EefficientNet. The representation of the visual domain is computed as follows:

F_{v}={\operatorname{EfficientNet}}\left({M_{v}}\right).

(1)

IV-A2 EEG Encoder

MCGRAM [8] is a compact convolutional neural network specifically designed for EEG signals, which comprises three primary components: the frequency encoder, spatial encoder, and temporal encoder.

The frequency encoder can effectively learn frequency-related features and capture the spectral characteristics of EEG signals by employing specific kernels in the multi-scale convolution module. Spatial representations are learned by the spatial encoder using a graph convolution module, which exploits the inherent spatial relationships between various electrode channels, enabling the spatial patterns present in the EEG data to be captured by MCGRAM. To extract temporal features and obtain the final global representation of the EEG modality, and the temporal encoder uses a two-layer long short-term memory network and a self-attention module. This combination allows temporal dependencies to be effectively modeled and crucial temporal patterns in the EEG signals to be captured by the model.

In summary, MCGRAM can capture and encode relevant information from EEG signals by cascading frequency, spatial, and temporal blocks. This resulted in a comprehensive and informative representation for further analysis and classification tasks. The representation of the EEG modality is computed as:

F_{e}={\operatorname{MCGRAM}}\left({M_{e}}\right).

(2)

It should be noted that alternative algorithms can be employed as substitutes for the visual and EEG encoders. For example, algorithms such as ResNet-50 [18] and MobileNetV2 [19] can be used in the visual encoder, whereas EEGNet [57] and TSception [58] are viable options in the EEG encoder. The superiority and generalization of the AMBOKD approach proposed in this study can be effectively demonstrated by incorporating these alternative algorithms and conducting subsequent experimental validations (see Section V-E).

IV-B Fusion of Modality Representation

We develope a fusion model to integrate information from both EEG and visual modalities. This model is considered as novel modality that collaborates with the original modality in the mutual learning process. To focus on the most important information across various representations subspaces [59], a multihead self-attention module is employed within the fusion model. The creation of a global representation that captures the combined knowledge is facilitated by this module.

The multi-head self-attention module is designed to extract and align crucial information from EEG and visual modality features, as illustrated in Fig. 8. Dynamically learned weights are used to weigh and fuse the features of both modalities to obtain global fusion features.

Specifically, a fully connected (FC) layer is used to align the EEG feature $F_{e}\in\mathbb{R}^{f^{e}}$ and visual feature $F_{v}\in\mathbb{R}^{f^{v}}$ of each pair of data. To obtain the preliminary fused feature $F_{c}\in\mathbb{R}^{f^{c}}$ , the aligned features are then concatenated. Three sets of FC layers were used to independently transform the features $F_{c}$ into queries $Q$ , keys $K$ , and values $V$ for the attention layer of the $j$ -th head. The softmax function is used to multiply, scale, and normalize $Q$ and $K$ to learn the attention scores $A\in\mathbb{R}^{f^{c}\times f^{c}}$ between the two modalities:

{A}=\operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_{l}}}\right),

(3)

where $\sqrt{d_{l}}$ is employed to scale the matrix multiplication result. Performing matrix multiplication between the values $V$ and the attention scores $A$ describes the output of the single-head attention as follows:

{H}_{j}=A_{j}V_{j}.

(4)

Finally, to obtain the final fusion features $F_{f}$ , the outputs of all heads are concatenated and passed through an FC layer with a softmax function:

F_{f}=\operatorname{softmax}\left(\operatorname{FC}\left({||}_{j}^{J}\left(H_{% 1},H_{2},\dots,H_{J}\right)\right)\right),

(5)

where $||$ denotes the concat function and $J$ represents the number of heads.

The features obtained from the three domains are classified separately using their respective classifiers. For example, considering the EEG domain, the classification process can be described as follows:

\text{G}_{e}=W_{e}{F_{e}}+b_{e},

(6)

where $W_{e}$ denotes the weights and $b_{e}$ is the biases.

IV-C Mutual Learning with Online Knowledge Distillation

The OKD approach enables the teacher model and the student model to influence and adapt to each other during the training process. This dynamic interaction allows the teacher model to adjust the knowledge it transfers, thereby improving the student model’s performance. By employing the OKD approach to facilitate mutual learning between the modalities, different modalities can learn from each other and progress together throughout the training process, fully leveraging the strengths and improving the performance of each modality. Through iterative optimization of each modality, the optimal fusion model for the entire framework is obtained by incorporating the knowledge from all three modalities.

The training process starts by inputting the logits of the three domains, as illustrated in Fig. 9. Each modality is then treated as a student model sequentially. For each student modality, its CE loss $L_{CE}^{S}$ and the KD losses $L_{KD}^{T_{a}}$ and $L_{KD}^{T_{b}}$ from two teacher modalities are calculated. The CE loss ( $L_{CE}^{S}$ ) effectively quantifies the discrepancy between the predictions of the student model and the true labels, which is computed as follows:

\mathcal{L}_{CE}^{S}=-\frac{1}{N}\sum_{i}\left[y_{i}\log\left(p_{i}\right)+% \left(1-y_{i}\right)\log\left(1-p_{i}\right)\right],

(7)

where $N$ is the sample size of a batch, and $p_{i}$ represents the probability assigned by the student model for sample $i$ belonging to the positive class. Although CE loss provides valuable learning signals, it is insufficient for enabling various modalities to learn from each other. The KD loss function effectively quantifies the dissimilarity between the teacher and student models to address this limitation, thereby facilitating the distillation of knowledge from the teacher to the student. The KD loss is computed using the Kullback-Leibler (KL) divergences, given the teacher model:

\mathcal{L}_{KD}^{T}=\frac{1}{N}\sum_{i}q_{i}\log\frac{q_{i}}{p_{i}},

(8)

where $q_{i}$ denotes the probability assigned by the teacher model for sample $i$ belonging to the positive class.

By considering the CE loss and the KD loss, the overall loss $\mathcal{L}_{total}$ for each modality is obtained as follows:

\mathcal{L}_{total}=\mathcal{L}_{CE}^{S}+\alpha{\tau}^{2}\mathcal{L}_{KD}^{T_{% a}}+\beta{\tau}^{2}\mathcal{L}_{KD}^{T_{b}},

(9)

where $\tau$ is the distillation temperature in knowledge distillation. $\alpha$ and $\beta$ represent the interaction weights that determine the influence contribution of distillation from the two teacher models and balance the magnitude difference between the CE loss and KD loss.

IV-D Mutual Learning with Adaptive Modality Balancing

To address imbalance optimization [60] in the learning process of each modality, an AMB module for mutual learning with OKD is proposed. There are two blocks in the module: dynamic weights for kd losses and dynamic ratios for backward gradients, detailed as follows.

IV-D1 Dynamic Weights for KD Losses

In the online knowledge distillation training process, the KD loss from the teacher model encapsulates the distilled knowledge. Most existing methods [49, 61, 53] compute the total training loss by adding the cross-entropy loss to the knowledge distillation loss, typically setting $\alpha$ and $\beta$ to 1 in Eq. (9). However, due to modality differences, simply averaging fails to effectively capture the superior knowledge from the teacher model. Furthermore, manually setting a fixed distillation loss weight often entails extensive experimentation and significant time costs.

To address the challenges mentioned above, we introduce a weight modulation block that dynamically calculate the weights for KD losses from different teacher models. This block efficiently extract knowledge during multimodal mutual learning. The entire workflow is described in Algorithm 1 in detail. The saturation function $\text{Sat}\left(\cdot\right)$ is defined as follows for $a<b$ : $\text{Sat}\left(x,a,b\right)=a$ if $x\geq a$ , $\text{Sat}\left(x,a,b\right)=x$ if $a\geq x\geq b$ , $\text{Sat}\left(x,a,b\right)=b$ if $x\leq b$ . The parameters $\alpha_{\min}$ and $\alpha_{\max}$ denote the lower and upper bounds of $\alpha$ , similar to $\beta_{\min}$ and $\beta_{\max}$ for $\beta$ .

Algorithm 1 Dynamic weights modulation process

0: Training Dataset

X

, iteration number

T

, batchsize

B_{m}

, knowledge distillation temperature

\tau

for

t=0,1,\ldots,T

Sample minibatch

B_{t}

from

X

with batchsize

B_{m}

Feed-forward the batched data

B_{t}

to the model

Compute the CE losses

\mathcal{L}_{CE_{(t)}}^{S}

\mathcal{L}_{CE_{(t)}}^{T_{a}}

and

\mathcal{L}_{CE_{(t)}}^{T_{b}}

Compute the KD losses

\mathcal{L}_{KD_{(t)}}^{T_{a}}

and

\mathcal{L}_{KD_{(t)}}^{T_{b}}

Compute

\alpha=\text{Sat}\bigg{(}\frac{\mathcal{L}_{CE_{(t)}}^{S}}{\mathcal{L}_{CE_{(t% )}}^{T_{a}}},{\alpha}_{\min},{\alpha}_{\max}\bigg{)}

Compute

\beta=\text{Sat}\bigg{(}\frac{\mathcal{L}_{CE_{(t)}}^{S}}{\mathcal{L}_{CE_{(t)% }}^{T_{b}}},{\beta}_{\min},{\beta}_{\max}\bigg{)}

Compute the total loss

{L}_{{total}_{(t)}}^{S}=\mathcal{L}_{CE_{(t)}}^{S}+\alpha{\tau^{2}}\mathcal{L}% _{KD_{(t)}}^{T_{a}}+\beta{\tau^{2}}\mathcal{L}_{KD_{(t)}}^{T_{b}}

end for

IV-D2 Dynamic Ratios for Backward Gradients

Current multimodal methods lack the consideration of the modality imbalance optimization. Thus, we propose a dynamic gradient modulation block to dynamically balance the modal optimization level in multimodal mutual learning process, ensuring full optimization of different modalities.

We further analysis the imbalance optimization phenomenon during mutual learning by calculating the backward gradient of each modality. In the backward phase of the training process, when the fusion modality acts as the student, the gradient depends on the CE loss can be calculated as $\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f}^{i}}$ using the following formulas. The CE loss of the fusion modality, denoted by $\mathcal{L}_{CE}^{f}$ , is given by

\mathcal{L}_{CE}^{f}=-\frac{1}{N}\sum_{i}^{N}\log\frac{e^{G_{f\left(y_{i}=c% \right)}^{i}}}{\sum_{k=1}^{M}e^{G_{f\left(y_{i}=k\right)}^{i}}},

(10)

where $G_{f\left(y_{i}=c\right)}^{i}$ is the predicted value when the true label $y_{i}$ equals class $c$ , and $M$ is the total number of categories. $\operatorname{G}_{f}^{i}$ is the $i$ -th logit output of the fusion model, calculated as

\operatorname{G}_{f}^{i}=W_{f}\left(W_{f}^{v}\phi_{v}(\theta_{v},F_{v}^{i})+W_% {f}^{e}\phi_{e}(\theta_{e},F_{e}^{i})\right)+b_{f},

(11)

where $\phi_{v}(\theta_{v},\cdot)$ and $\phi_{e}(\theta_{e},\cdot)$ represent the visual and EEG modality encoders, respectively, with $\theta_{v}$ and $\theta_{e}$ as their parameters. $W_{f}^{e}$ and $W_{f}^{v}$ are the weight matrices that determine the relative significance of the features extracted from the visual and EEG modalities. $W_{f}$ and $b_{f}$ act as the parameters in the fusion process.

Combined with Eq. (10) and Eq. (11), the gradient can be expressed as

\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f\left(y_{i}=c\right)}^{i}}=% \frac{e^{{\left(W_{f}W_{f}^{v}\phi_{v}^{i}+W_{f}W_{f}^{e}\phi_{e}^{i}+b_{f}% \right)}_{y_{i}=c}}}{\sum_{k=1}^{M}e^{{\left(W_{f}W_{f}^{v}\phi_{v}^{i}+W_{f}W% _{f}^{e}\phi_{e}^{i}+b_{f}\right)}_{y_{i}=k}}}-1_{y_{i}=c},

(12)

$\phi_{v}(\theta_{v},F_{v}^{i})$ and $\phi_{e}(\theta_{e},F_{e}^{i})$ are simplified as $\phi_{v}^{i}$ and $\phi_{e}^{i}$ for convenience. When a particular modality, such as the visual modality, exhibits classification performance, it contributes more to the gradient $\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f}^{i}}$ through the expression $W_{f}W_{f}^{v}\phi_{v}^{i}$ , thereby leading to a lower loss globally. Therefore, the EEG modality, characterized by lower confidence in accurate predictions, will obtain limited optimization during parameter updates via backpropagation. This phenomenon shows that the unimodal model might be overtrained or underoptimized when the training of the multimodal model is about to converge.

In this imbalanced training phenomenon, the disparate learning efficiencies of unimodality hinder the performance of the fusion model, resulting in under-optimized and overfitting representations that constrain the overall model performance. Thus, our proposed dynamic modulation block aims to dynamically balance the learning levels of modalities until they reach their optimal performance.

The updating process is presented in Algorithm 2. Specifically, the ratios $R^{S}_{(t)}$ , $R^{T_{a}}_{(t)}$ and $R^{T_{b}}_{(t)}$ are computed for indicating the current training optimization level of each modality. To ensure consistent training optimization progress and maximize the completeness and effectiveness of the optimization in multimodal mutual learning, the dynamic learning progress ratio in $t$ -th iteration step is computed as:

R_{(t)}^{DG}=\begin{cases}1&\text{if }E_{n}=1\\ \text{Sat}\left(\big{(}\frac{R_{(t)}^{T_{a}}+R_{(t)}^{T_{b}}}{2\times R_{(t)}^% {S}}\big{)}^{\gamma},R_{\min},R_{\max}\right)&\text{if }E_{n}>1\end{cases}

(13)

where $R_{\min}$ and $R_{\max}$ represent the lower and upper bounds in the function, $\gamma$ is a hyper-parameter to control the degree of modulation.

The extensively used adaptive moment estimation (Adam) optimization approach is employed in the backpropagation process. The primary idea underlying the Adam algorithm is to maintain a running average of the first-order moment (mean) and the second-order moment (variance) of the gradients. This facilitates the estimation of adaptive learning rates for various parameters during optimization. Using the dynamic modulation ratio $R_{(t)}^{DG}$ , the parameters $\theta^{S}$ in the student model are optimized adaptively based on the modality learning progress. The updating process is expressed as

{\theta}^{S}_{(t+1)}\leftarrow{\theta}^{S}_{(t)}-R_{(t)}^{DG}\frac{\eta}{\sqrt% {D^{\rm{var}}_{(t)}}+\epsilon}\cdot D^{\rm{mean}}_{(t)}

(14)

where $\eta$ is the learning rate and $\epsilon$ represents a small constant added to the denominator for numerical stability. $D^{\rm{mean}}_{(t)}$ and $D^{\rm{var}}_{(t)}$ represent the exponentially decaying average of the past gradients’ first moment estimate (mean) and second moment estimate (variance) at time $t$ .

Incorporating the AMB module into the loss computing process and the optimization process addresses the challenge of modality imbalance in the multimodal mutual learning process. This mechanism ensures that all modalities learn the effective knowledge and optimized at a comparable pace during mutual learning. Consequently, enhanced completeness, integration, and efficiency are achieved by the fusion model.

Algorithm 2 Dynamic gradients modulation process

0: Training Dataset

X

, iteration number

T

, batchsize

B_{m}

, model parameters

\theta^{S}

, learning rate

\eta

, hyper-parameter

\gamma

, current epoch

E_{n}

for

t=0,1,\ldots,T

Sample minibatch

B_{t}

from

X

with batchsize

B_{m}

Feed-forward the batched data

B_{t}

to the model

Compute the losses

\mathcal{L}_{CE_{(t)}}^{S}

\mathcal{L}_{CE_{(t)}}^{T_{a}}

\mathcal{L}_{CE_{(t)}}^{T_{b}}

and

\ell_{total}\left(x;\theta_{(t)}^{S}\right)

E_{n}==1

then

Set

R_{(t)}^{DG}\leftarrow 1

else

Update

R^{S}_{(t)}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{S}-\mathcal{L}_{CE_{(t)}}^% {S}}{\mathcal{L}_{\rm{base}}^{S}}

R_{(t)}^{T_{a}}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{T_{a}}-\mathcal{L}_{CE% _{(t)}}^{T_{a}}}{\mathcal{L}_{\rm{base}}^{T_{a}}}

R_{(t)}^{T_{b}}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{T_{b}}-\mathcal{L}_{CE% _{(t)}}^{T_{b}}}{\mathcal{L}_{\rm{base}}^{T_{b}}}

Compute

R_{(t)}^{DG}=\text{Sat}\big{(}\big{(}\frac{R_{(t)}^{T_{a}}+R_{(t)}^{T_{b}}}{2% \times R_{(t)}^{S}}\big{)}^{\gamma},R_{\min},R_{\max}\big{)}

end if

Compute

\tilde{g}\left(\theta_{(t)}^{S}\right)=\frac{1}{B_{m}}\sum_{x\in B_{(t)}}% \nabla_{\theta^{S}}\ell_{total}\left(x;\theta_{(t)}^{S}\right)

Compute

D^{\rm{mean}}_{(t)}=\beta_{1}\cdot D^{\rm{mean}}_{(t-1)}+\left(1-\beta_{1}% \right)\cdot\tilde{g}\big{(}\theta_{(t)}^{S}\big{)}

Compute

D^{\rm{var}}_{(t)}=\beta_{2}\cdot D^{\rm{var}}_{(t-1)}+\left(1-\beta_{2}\right% )\cdot\tilde{g}^{2}\big{(}\theta_{(t)}^{S}\big{)}

Update

{\theta}^{S}_{(t+1)}\leftarrow{\theta}^{S}_{(t)}-R_{(t)}^{DG}\frac{\eta}{\sqrt% {D^{\rm{var}}_{(t)}}+\epsilon}\cdot D^{\rm{mean}}_{(t)}

end for

E_{n}==1

then

Calculate average CE losses

\mathcal{L}_{\rm{base}}^{S},\mathcal{L}_{\rm{base}}^{T_{a}},\mathcal{L}_{\rm{% base}}^{T_{b}}

end if

TABLE I: The comparison studies of the baseline and state-of-art methods in subject-independent experiments

Method	Session 1					Session 2
Method	AUC/Std (%)	F1/Std (%)	ACC/Std (%)	Precision/Std (%)	p-value (AUC)	AUC/Std (%)	F1/Std (%)	ACC/Std (%)	Precision/Std (%)	p-value (AUC)
EEGNet	70.65/6.55	69.12/6.07	71.68/4.98	69.84/6.14	$<10^{-3}$	68.86/5.46	67.44/5.76	69.51/5.71	68.00/5.20	$<10^{-3}$
MCGRAM	82.47/7.11	75.18/8.47	76.69/6.55	76.04/9.48	$<10^{-3}$	80.44/5.32	72.33/6.12	73.59/5.81	74.34/4.95	$<10^{-3}$
AMBOKD-E^*	83.56/6.61	77.66/8.21	78.59/7.19	78.25/8.31	-	81.85/5.71	75.95/5.75	76.71/5.09	77.05/5.07	-
ResNet-50	84.55/7.54	75.45/11.26	77.37/9.93	79.84/5.21	$<10^{-3}$	84.42/6.63	76.04/7.43	77.04/7.36	79.53/6.03	$<10^{-3}$
EfficientNet	78.81/8.89	81.82/2.79	83.96/2.40	87.02/1.47	$<10^{-3}$	84.13/3.92	80.68/4.63	82.81/4.15	86.53/2.34	$<10^{-3}$
AMBOKD-V^*	88.54/3.50	83.08/2.52	84.88/2.25	87.54/1.49	-	87.88/5.22	82.22/3.84	83.96/3.59	87.18/2.15	-
MLB	86.56/3.69	82.31/2.73	84.21/2.40	86.66/1.62	$<10^{-3}$	84.74/4.42	80.95/7.85	82.74/6.13	85.20/3.64	$<10^{-3}$
MKD	90.21/4.45	82.87/2.45	84.73/2.21	87.54/1.39	$<10^{-3}$	89.35/3.54	81.80/3.97	83.65/3.68	87.04/2.12	$<10^{-3}$
EMKD	90.51/4.22	82.65/2.42	84.57/2.18	87.42/1.34	$<10^{-3}$	88.93/3.50	80.95/4.00	83.00/3.68	86.62/2.06	$<10^{-3}$
CA-MKD	90.52/4.19	82.78/2.52	84.67/2.24	87.46/1.42	$<10^{-3}$	89.10/3.39	81.45/4.03	83.39/3.70	86.83/2.20	$<10^{-3}$
DML^*	90.46/3.68	79.73/4.97	82.44/3.85	85.89/2.38	$<10^{-3}$	88.87/3.21	77.66/5.50	80.60/4.52	84.84/2.87	$<10^{-3}$
KDCL^*	89.91/4.22	79.05/4.72	81.74/3.71	84.55/2.25	$<10^{-3}$	89.12/3.11	78.70/4.63	81.34/4.02	85.42/2.30	$<10^{-3}$
EML^*	89.21/4.02	80.25/2.92	82.52/2.54	84.74/1.92	$<10^{-3}$	90.64/3.06	80.89/4.06	82.93/3.74	86.40/2.16	$<10^{-3}$
AMBOKD (ours)^*	93.66/2.95	83.33/2.51	85.08/2.24	87.82/1.38	-	93.30/2.64	82.32/3.66	84.05/3.42	87.27/2.02	-

^* Online knowledge distillation method

V Experimental Results

In this section, the experimental settings with the parameter configurations of our proposed approach are first presented. Subsequently, a comprehensive comparison and ablation study are conducted to evaluate the performance of the proposed AMBOKD approach in the dim object recognition task. Then, the effectiveness and generalisability of the OKD and AMB modules incorporated in AMBOKD are evaluated. In addition, the transfer experiments and system verifications by simulating real scenarios are designed to further verify the generalization performance of the proposed method and system under few-shot conditions.

V-A Experimental Settings

V-A1 Configurations

The visual encoder uses EfficientNet [22] as the basic approach for feature extraction from the input images. Furthermore, for ablation analysis, ResNet-50 [18] and MobileNetV2 [19] are used as alternative visual encoders. The EEG encoder is based on our previous study called MCGRAM [8], which is employed to extract features from the EEG data. In addition, TSception [58] and EEGNet [57] are used as alternative EEG encoders for the ablation analysis. The parameters of the visual and the EEG encoders are set to follow the configurations as described in the previous study, providing consistency and comparability with the existing studies. To align the features obtained from the visual and EEG encoders, FC layers are used to scale the lengths of both sets of features to a common length of 64. This ensures compatibility for subsequent processing and fusion of the multimodal features.

The number of heads is set to 2 in the multi-head self-attention module of the fusion model. The kernel sizes of the FC layers for alignment are configured as $1280\times 64$ and $256\times 64$ for visual and EEG features, respectively. Before computing the Q, K, and V matrices, the matrices in the FC layer have a size of $64\times 64$ . The FC layers have a size of $128\times 2$ in the classifier of the Fusion domain.

The temperature coefficient $\tau$ is set to 4 in the OKD approach to control the softness of the KD targets. The hyperparameters $\alpha$ and $\beta$ are dynamically adjusted by the AMB module, which contributes to the regularization component of the overall loss function. The hyperparameter $\gamma$ is set to 3 to fine-tune the sensitivity of the AMB module. In addition, we impose lower and upper limits, $R_{\min}$ and $R_{\max}$ at 0.1 and 10, respectively, to confine the range of dynamic modulation ratio $R^{DG}_{(t)}$ during the training process.

The PyTorch framework is used to implement the AMBOKD model. The Adam optimizer with a learning rate of $10^{-4}$ is used to optimize all network parameters. The model is trained for 15 epochs with a batch size of 64. The optimization loss function employed comprised the CE loss and the KD loss. A cross-validation approach is used to assess the performance of the AMBOKD model. The data of each subject are chosen as the validation set, whereas the data of the remaining nine subjects are employed as the training set. We repeat this process for each fold with five different seeds, and we obtain the final experimental finding by averaging the results of all the folds and seeds.

V-A2 Performance Metrics

A comprehensive set of metrics, including area under the receiver operating characteristic curve (AUC), accuracy (ACC), F1 score (F1) and Precision are used to evaluate the performance of classification models. Each metric is analyzed in terms of mean and standard deviation, offering insight into the reliability of the model and its ability to generalize across various datasets or conditions. Specifically, the AUC is emphasized as the main metric because of its effectiveness in measuring the discriminative ability of a model, especially in scenarios with unbalanced class distributions. To evaluate the object detection performance of the brain-eye-computer based system, AP@50:95 and AR@50:95 are used to measure the average precision and recall over Intersection Over Union (IOU) thresholds ranging from 0.50 to 0.95, while AP@50 and AR@50 are used for an IOU of 0.50. In addition, F1@50:95 and F1@50 provide a balanced view of accuracy by considering both precision and recall at these thresholds.

TABLE II: The subject-dependent results of the baseline and state-of-art methods in comparison studies.

Session	Method	AUC/Std (%)
Session	Method	Subject 1	Subject 2	Subject 3	Subject 4	Subject 5	Subject 6	Subject 7	Subject 8	Subject 9	Subject 10	Average
1	EEGNet	63.57/6.44	64.12/6.68	61.60/7.72	54.55/7.16	60.14/8.58	61.49/7.17	54.22/5.27	61.66/6.21	69.59/7.25	56.30/6.83	60.72/6.93
	MCGRAM	81.67/7.49	87.10/5.54	86.36/6.09	78.20/9.01	80.22/6.75	82.31/5.68	77.47/6.90	79.59/6.15	91.79/4.71	78.29/5.65	82.30/6.40
	AMBOKD-E^*	86.98/6.35	90.57/4.70	88.78/6.01	81.64/7.71	87.30/4.85	88.33/4.83	81.41/5.78	86.18/4.50	93.66/3.80	82.56/4.63	86.74/5.32
	ResNet-50	83.22/6.03	81.73/6.00	81.67/5.53	79.56/6.87	80.75/6.48	77.86/7.12	83.37/6.43	79.29/6.69	81.68/5.75	80.85/6.42	81.00/6.33
	EfficientNet	76.28/9.97	74.91/10.87	79.74/8.73	79.27/8.30	76.99/8.40	73.59/10.40	75.54/8.19	76.70/8.96	78.75/8.27	73.09/9.43	76.49/9.15
	AMBOKD-V^*	78.28/10.18	82.63/8.81	86.31/6.56	82.39/7.88	83.74/5.84	81.64/7.09	80.86/8.42	82.18/6.91	79.97/7.04	78.67/8.89	81.67/7.76
	MLB	89.82/4.73	87.44/5.40	92.07/5.05	84.22/8.33	86.28/6.54	75.79/9.03	83.75/6.79	83.18/7.33	92.33/4.46	83.86/7.21	85.87/6.49
	MKD	89.31/5.74	89.51/5.12	92.87/4.38	85.81/7.00	88.46/5.32	82.62/6.12	87.41/6.28	87.59/5.12	93.46/3.45	85.32/6.81	88.24/5.53
	EMKD	90.38/4.99	89.56/5.15	92.93/4.30	85.73/6.72	88.70/5.15	82.20/6.06	86.62/6.62	86.38/5.89	94.06/3.24	85.28/6.92	88.18/5.50
	CA-MKD	90.23/4.97	89.62/5.12	92.45/4.43	85.93/7.29	87.52/5.81	81.72/7.05	86.17/6.43	87.72/4.90	93.11/3.98	85.62/6.31	88.01/5.63
	DML^*	91.02/4.73	89.68/4.61	92.44/4.35	86.63/7.18	88.35/5.49	89.14/4.49	87.64/5.70	89.45/4.22	94.07/3.28	87.73/5.43	89.62/4.95
	KDCL^*	91.04/5.00	89.45/4.97	91.51/5.11	86.35/7.20	87.30/6.23	87.82/5.71	86.91/6.12	87.43/4.60	93.83/3.82	85.23/7.05	88.69/5.58
	EML^*	88.50/5.88	83.49/7.64	87.76/5.82	83.82/7.27	84.06/7.61	85.95/6.25	81.47/8.63	84.48/5.73	90.63/4.44	80.23/7.66	85.04/6.69
	AMBOKD (ours)^*	94.29/3.89	94.29/3.63	95.37/3.30	89.56/5.53	93.99/3.49	94.50/3.30	92.34/5.25	93.80/3.52	96.69/2.64	91.78/3.56	93.63/3.81
2	EEGNet	58.40/6.78	66.72/6.18	55.52/5.99	59.68/7.43	60.74/8.02	57.46/6.14	58.15/8.59	67.20/6.64	62.72/7.64	57.44/6.60	60.40/7.00
	MCGRAM	69.29/8.21	86.73/5.49	72.86/6.83	79.39/7.15	79.54/7.29	78.40/7.25	74.39/7.99	86.32/4.77	90.01/4.01	77.03/5.56	79.40/6.46
	AMBOKD-E^*	74.87/7.51	90.37/4.73	78.00/6.94	87.15/4.67	85.99/6.60	82.78/7.19	79.37/6.62	90.21/3.56	92.50/3.17	83.13/5.11	88.39/5.61
	ResNet-50	83.49/6.72	80.59/6.89	82.76/6.37	79.72/5.42	84.70/5.41	79.40/6.48	80.23/6.51	82.58/6.40	78.43/5.51	80.55/6.37	81.25/6.21
	EfficientNet	73.77/9.28	80.58/8.90	73.05/10.96	73.54/11.58	74.76/10.13	75.96/10.85	74.12/8.66	74.42/8.60	75.33/10.74	72.65/10.40	74.82/10.01
	AMBOKD-V^*	85.02/6.04	86.06/6.74	86.33/72.74	83.79/6.95	82.37/7.33	83.66/7.08	78.47/72.58	74.87/7.78	79.45/7.52	77.37/8.52	81.74/7.25
	MLB	78.24/8.62	90.62/6.12	80.23/6.53	85.27/5.69	82.99/7.49	77.27/8.72	85.00/5.92	83.19/5.93	89.99/5.39	85.18/5.83	83.80/6.62
	MKD	81.98/7.78	92.82/3.91	79.38/7.58	84.04/6.68	85.08/6.22	81.48/7.88	85.15/6.48	87.82/5.19	91.60/4.86	85.29/6.52	85.46/6.31
	EMKD	82.38/8.14	93.69/3.66	80.53/6.96	84.86/6.43	85.53/6.13	78.71/8.58	85.12/6.50	87.12/5.97	91.88/4.50	84.75/6.69	85.46/6.36
	CA-MKD	86.01/8.41	94.19/3.82	85.09/5.96	84.16/6.64	84.05/6.99	79.97/8.15	85.35/6.22	87.40/5.16	93.72/3.69	84.40/7.17	86.43/6.22
	DML^*	81.09/8.21	89.87/4.85	81.62/6.94	85.18/5.94	88.33/5.25	87.97/4.57	85.89/6.29	89.82/3.82	92.58/4.01	83.76/6.28	86.61/5.62
	KDCL^*	79.15/8.96	86.05/6.04	78.99/6.65	83.08/7.13	86.93/5.31	87.12/4.45	84.96/6.85	88.41/3.94	92.74/4.12	85.20/6.32	85.26/5.98
	EML^*	81.09/8.18	83.99/7.24	79.19/6.87	81.25/7.49	87.51/4.86	85.89/6.38	83.90/6.09	87.89/4.76	89.21/5.30	83.53/6.52	84.34/6.37
	AMBOKD (ours)^*	89.56/5.20	96.97/2.40	90.37/4.76	93.94/3.61	94.09/4.00	93.08/3.99	90.54/4.64	95.33/2.44	96.58/2.99	91.83/3.96	93.23/3.80

^* Online knowledge distillation method

V-B Comparison Results

V-B1 Comparison results on ESSVP dataset

To verify the effectiveness of the proposed AMBOKD approach, we conduct a series of comparison experiments with the baseline and state-of-art models, including unimodal approaches EEGNet [57], MCGRAM [8], ResNet-50 [18], EfficientNet [22], multimodal fusion approach MLB [62], multimodal knowledge distillation approaches MKD [49], EMKD [51], CA-MKD [50], multimodal online knowledge distillation approaches DML [47], KDCL [52], EML [53]. In addition, the EEG and visual modal model trained in AMBOKD are named AMBOKD-E and AMBOKD-V.

As indicated in Table I, our proposed AMBOKD achieve impressive results in the dim object recognition task of subject-independent experiments. This outcome surpasses the performance of baseline and state-of-the-art approaches. Meanwhile, the mutual learning mode with AMB module enhances the performance of EEG and visual modality models: all three metrics of AMBOKD-E and AMBOKD-V surpass those of the original unimodality models MCGRAM and Efficient-Net. The AUC-based ANOVA statistical test findings demonstrate that our proposed AMBOKD, AMBOKD-E, and AMBOKD-V significantly outperform all comparison approaches in subject-independent experiments ( $p<0.05$ )

Table II shows the comparison results obtained from subject-dependent experiments. Notably, optimal results are achieved across all subjects using our approach, indicating significant advantages. This finding serves as a compelling demonstration of the effectiveness and adaptability of our method when applied to subject-specific experiments within practical scenarios.

These results lead to the conclusion that the AMBOKD method effectively combines the cognitive and visual domains, extracting fused global representations. Furthermore, it enables dynamic equilibrium mutual learning between unimodality models and the fusion modality during the training process, leading to improved performance in the fusion, EEG, and visual models.

V-B2 Comparison results on CIFAR-100 dataset

We conduct experiments on CIFAR-100 dataset [63] according to previous work [64]. The baseline methods include offline knowledge distillation methods (Vanilla [15], PKT [65]), representation knowledge distillation methods (CRD [66], RKD [67]), and online knowledge distillation methods (DML [47], KDCL [52], FT-KD [64], PESF-KD [64]). In the experiment, the combination of student and teacher networks includes ResNet56-ResNet20, ResNet110-ResNet32, ResNet56-VGG8, and VGG13-VGG8.

We use the same training setup as described in [64], employing an SGD optimizer with a momentum of 0.9, a batchsize of 64, a weight decay of $5\times 10^{-4}$ . The initial learning rate is set at 0.05, with a decay factor of 10 every 30 epochs starting from epoch 150. For the AMB module, we specify a learning rate of $1\times 10^{-4}$ . The comparison results with three different random seeds, presented as acc(@1, mean/standard deviation), are shown in Table III. Our AMBOKD achieve the best performance among all compared methods, demonstrating its robustness and practicability.

TABLE III: Comparison results on CIFAR-100 Test set

Method	ACC/Std (%)
Method	ResNet56-20	ResNet110-32	VGG13-8	ResNet56-VGG8
Vanilla [15]	70.95/0.51	73.08/0.42	73.36/0.24	73.98/0.33
PKT [65]	71.27/-	73.67/-	73.40/-	74.10/-
CRD [66]	71.44/-	73.62/-	73.31/-	74.06/-
RKD [67]	71.47/-	73.53/-	74.15/-	73.35/-
KDCL [52]	70.11/-	72.87/-	73.99/-	73.16/-
DML [47]	71.40/-	72.21/-	74.18/-	73.86/-
FT-KD [64]	71.65/0.11	73.90/0.22	73.52/0.14	74.40/0.20
PESF-KD [64]	71.84/0.27	74.23/0.26	74.74/0.39	74.67/0.28
AMBOKD (ours)	72.26/0.16	74.35/0.27	74.95/0.26	74.84/0.27

V-C Ablation Analysis

We design ablation experiments and perform data analysis and visualization analysis on the results, respectively.

V-C1 Data Analysis

The impact of incorporating OKD with mutual learning and AMB in multimodal learning is investigated through a detailed ablation analysis. Specifically, the role of each component is examined by progressively simplifying the proposed AMBOKD approach. MMOKD-DG and MMOKD-DK are the varients of our method, which keep the part of dynamic weights block and the dynamic gradients block in the AMB module, seprately. Then, MMOKD omits the entire AMB component, serving as a baseline to evaluate the underlying OKD and mutual learning mechanism. MKD further eliminates the online training method, representing a multi-teacher KD framework. The V-KD and E-KD approaches are further simplifications, using single-teacher KD from the visual and EEG modalities, respectively. Lastly, AMM is the most basic form of our model, which depends solely on a multi-head self-attention mechanism for modality fusion, without any KD. In addition, our analysis incorporates the unimodality encoders MCGRAM for EEG and EfficientNet for the visual domain.

The ablation results are presented in Table IV. As can be seen, these uni-modality encoders MCGRAM and EfficientNet are outperformed by the AMM model, which leverages multi-head self-attention for fusion, thereby validating the multi-head self-attention mechanism’s capacity to effectively integrate multimodal data. The results of E-KD, V-KD and M-KD further verify the effectiveness of the knowledge distillation mechanism to improve the performance of the multimodal fusion model. MMOKD outperform MKD by 1.37% and 1.9% in AUC, which demonstrate the effectiveness of mutual learning method for achieving fully modal interaction and improving multimodal model performance. The results of the MMOKD-DG and MMOKD-DK methods are further improved compared with the MMOKD method, which indicates that the adaptive balance of gradients and the interaction between different modal models can provide a more effective training scheme Our latest method, AMBOKD fuses the above two dynamic balancing schemes and achieves the optimal results, 2.08% and 2.05% higher than MMOKD, demonstrating the significant potential of modal balancing in multimodal mutual learning tasks.

TABLE IV: The ablation studies of different modules in subject-independent experiments

Method	Session 1		Session 2
Method	AUC/Std (%)	p-value (AUC)	AUC/Std (%)	p-value (AUC)
MCGRAM	82.47/7.11	$<10^{-3}$	80.44/5.32	$<10^{-3}$
EfficientNet	78.81/8.89	$<10^{-3}$	84.13/3.92	$<10^{-3}$
AMM	87.16/3.62	$<10^{-3}$	87.75/3.64	$<10^{-3}$
EKD	88.48/4.28	$<10^{-3}$	88.68/3.73	$<10^{-3}$
VKD	88.08/4.20	$<10^{-3}$	87.53/4.76	$<10^{-3}$
MKD	90.21/4.45	$<10^{-3}$	89.35/3.54	$<10^{-3}$
MMOKD	91.58/3.89	$<10^{-3}$	91.25/2.47	$<10^{-3}$
MMOKD-DK	91.90/3.48	$<10^{-3}$	91.61/2.92	$<10^{-3}$
MMOKD-DG	92.31/3.54	$<10^{-3}$	92.59/2.21	$<10^{-3}$
AMBOKD (ours)	93.66/2.95	-	93.30/2.64	-

V-C2 Visualization Analysis

The latent representations of the models in the ablation analysis are visualized using the t-distributed stochastic neighbor embedding (t-SNE) approach to further verify the feature extraction and fusion capability of our proposed approach. As depicted in Fig. 10, the feature distributions of all subjects are displayed with various colors of dots, with red dots representing the features in the target domain, and dots in other colors representing the features in the nontarget class.

Specifically, unimodality models such as EfficientNet and MCGRAM exhibit distinct visual and EEG modality characteristics in their feature representations, as illustrated in Fig. 10. The visual-based EfficientNet model exhibits a more discrete feature representation, with a large intra-class distance, indicating its superior feature extraction performance but poor generalization performance of the classifier. Conversely, the feature representation of the MCGRAM model based on EEG is more clustered, with a small intra-class distance, indicating weaker feature extraction ability but stronger classifier generalization. As a multimodal fusion model, AMM effectively combines the benefits of vision and EEG modalities, thereby further reducing the inter-class distance while maintaining the feature representation capability. By integrating single-teacher guidance mechanisms into the multimodal fusion model, V-KD and E-KD models exhibit enhanced classification performance and generalization, as reflected in larger inter-class and smaller intra-class differences in feature representation. In addition, by introducing a multi-teacher mechanism, OKD approach, and multimodal AMB approach, the discrimination of feature representation extracted by MKD, MMOKD, and AMBOKD models become more pronounced and the feature representation distances within the same class become closer. This indicates that these mechanisms enhance the capability of multimodal fusion and effectively improve the classifier’s accuracy and generalization.

In conclusion, the effectiveness of the AMBOKD framework and its modules is effectively verified by the ablation study. The proposed AMBOKD approach outperforms other methods in the multimodal fusion task, exhibiting robust discrimination between positive and negative samples, an increase in inter-class distance, a reduction in intra-class distance, and statistically significant experimental findings ( $p<0.05$ ).

V-D Adaptive Modality Balancing Analysis

We then analysis the effectiveness of the AMB module and compare the best sensitivity control of the AMB module.

V-D1 The effectiveness of the AMB

We use the dynamic learning progress ratio $R^{DG}_{(t)}$ to assess the effectiveness of the modality equilibrium. The first training data from a 10-fold cross-validation technique with a seed value of 1 was utilized.

As illustrated in Fig. 11, we draw the changes of $R^{DG}_{(t)}$ for each modality within MMOKD, MMOKD-DG, and AMBOKD. In the standard MMOKD approach, despite the intermodal learning capabilities, optimization is impeded by the inherent characteristics of each modality. Notably, the EEG modality progresses at a slower rate compared to the Visual and Fusion modalities, which exhibit quicker optimization speeds. This discrepancy in training speeds results in a suboptimal mutual learning effect among the modalities. The Dynamic Gradient (DG) block is able to addresses this issue by dynamically adjusting the training gradients according to each modality’s learning progression. With this adjustment, MMOKD-DG improves the optimization speed and the final optimization degree of each modality, especially EEG modality. AMBOKD further employs a dynamic weight modulation module tailored to KL loss, optimizing the faster-learning Visual and Fusion modalities initially, while also supporting sustained development in the EEG modality. This strategy ensures that all three modalities are optimally and uniformly enhanced throughout the training process, achieving superior overall performance.

V-D2 The sensitivity of the AMB

The sensitivity control of the AMB module is based on the hyperparameter $\gamma$ , which can be found in Eq. (14). As shown in Fig. 12, the AMBOKD method with $\gamma=3$ achieves the highest AUC in session 1 and second-highest AUC in session 2, resulting in an improved classification performance for imbalanced samples.

TABLE V: The subject-independent reuslts of the baseline and state-of-art methods in generalization analysis

Method	Session 1		Session 2
Method	AUC/Std (%)	F1/Std (%)	AUC/Std (%)	F1/Std (%)
MLB	75.05/10.33	74.75/6.91	77.54/9.43	78.17/9.33
MKD	85.38/6.53	76.85/7.69	82.86/6.69	74.13/9.69
EMKD	86.56/6.65	77.08/7.49	83.61/7.24	74.78/9.04
CAMKD	86.57/6.82	75.64/6.97	84.02/6.97	73.03/10.38
DML	87.76/6.21	76.24/7.98	86.53/5.33	75.27/8.86
KDCL	88.13/6.44	77.75/7.32	86.91/5.29	74.98/7.92
EML	87.33/6.99	75.91/7.34	87.36/6.05	76.15/8.47
AMBOKD	88.52/5.57	80.34/7.70	87.76/5.53	78.98/7.88

V-E Generalization Analysis

To further analysis the generalization performance of the AMBOKD method, we conduct experiments on the algorithm level and the data level.

V-E1 Generalization Analysis on Algorithm Level

Comparative experiments are conducted by replacing the unimodal backbone networks used in the EEG and visual encoders. With these experiments, we are able to further validate the generalization capability of the proposed AMB module in multimodal learning but also its ability on improving unimodality performance. We test the following combinations: EEGNet + EfficientNet, TSception + EfficientNet, MCGRAM + MobileNet, MCGRAM + ResNet50, EEGNet + MobileNet, and TSception + ResNet50.

As shown in Fig. 13, the unimodality models trained with AMBOKD approach consistently achieve higher performances than the models trained by their own in all combinations. In addition, some uni-modal models, such as the EEGNet model in Fig. 13 (a) $\sim$ (b), and the ResNet50 model in Fig. 13 (f), experienced considerable performance improvements compared with their original counterparts after dynamically trained with AMBOKD. For example, EEGNet trained under AMBOKD outperforms the original results in AUC by 7.7 %, ResNet50 trained under AMBOKD shows an improvement of 9.3 % in AUC (see Fig. 13 (a), (c)). These findings indicate the superiority and generalization capabilities of the AMBOKD in multimodal learning tasks.

V-E2 Generalization Analysis on Data Level

We verify the generalization of the AMBOKD method by conducting transfer experiments under few-sample conditions. In this experiment, we utilize the optimally saved model that is specifically trained for armored vehicle recognition as detailed in Section V-B1. This model is further refined using 30 real-world scene samples and is deployed to detect vehicle targets within these real scenes. The comparative results, presented in Table V, demonstrate that our methods achieved superior performance in both session 1 and session 2. This case effectively verifies the effectiveness and generalization ability of our method.

V-F Physical Verification

To further verify the pratical and effectiveness of our build system and proposed method, we carry our physical experiments combined with the application scenario of UAV ground control station operators, as illustrated in Fig.14. During the system experiment, subjects are positoned in front of a computer screen with proper body alignment, focusing their gaze directly on the frontal display while performing the ESSVP paradigm. They are tasked with controlling the system and issuing appropriate commands to the UAV upon detecting abnormal displays on the UAV monitoring interface located to the participant’s left. Concurrently, the ESSVP paradigm temporarily halts the playback of visual stimuli to accommodate this interaction.

In the system verification experiment, the models refined in SectionV-E are employed directly to process the multimodal data and detect the target. As shown in Fig. 15, the AMBOKD method achieves the highest performances in all metrics compare with the baseline and state-of-art methods. This experimental findings demonstrate the effectiveness of the brain-eye-computer based object detection system in real-world settings, indicating robustness and high accuracy in practical tasks.

TABLE VI: Comparison between human, computer, and the brain-eye-computer based system

Method	AP@50:95 (%)	AR@50:95 (%)	F1@50:95 (%)	Total Time (s)
Human	78.28/18.20	67.33/20.26	72.00/18.71	1026.00/362.00
Computer	23.44/5.49	30.12/6.89	26.36/6.10	3.00/0
Brain-eye-computer	89.26/6.76	90.32/4.01	87.98/6.37	245.37/1.38

Further more, we compared our brain-eye-computer based system with traditional manual and computer vision approaches, as shown in Table VI. In the human-based experiment, five subjects were asked to detect and annotate vehicle targets in 60 aerial images. The results indicate variability in detection time and accuracy among subjects, underscoring the inefficiency and unreliability of purely manual methods in complex scenarios. In the computer-based experiment, we adopt the same training setup as the brain-eye-computer based system, a Faster RCNN model with a ResNet50 backbone trained on the public dataset was fine-tuned with 30 samples and then tested on 60 images. Although this method achieve rapid detection capabilities, its accuracy was notably compromised due to the limited training samples. In contrast, our integrated brain-eye-computer system achieved superior performance, registering a total processing time of 245.37 s (240 s for the ESSVP paradigm and 5.37 s for processing). This result not only demonstrates the efficiency of the system, but also its robust generalization capabilities across different test conditions.

VI Conclusion

This study builds a brain-eye-computer based system for object detection in aerial images under few-shot conditions. This system detects suspicious targets in aerial images using a region proposal network. After obtaining the images with the region proposals, it elicits subjects’ ERP signals during target search with the ESSVP paradigm and constructs EEG-image data pairs incorporating eye movement data. These pairs are then recognized using the proposed AMBOKD method. AMBOKD fully extracts and fuses crucial information from EEG and visual modality features, facilitates end-to-end mutual learning, and improves adaptive multimodal interaction capability through the AMB module. Experimental results demonstrates the effectiveness and superiority of our method, and its ability in improving the performance of uni-modality models. Lastly, the feasibility and transferability of the AMBOKD method and the brain-eye-computer based system are verified through experiments with practical scenario images. This study opens new possibilities for robust and efficient dim object detection in aerial applications.

In the future, we will further develop real-time online systems across a wider range of scenarios. We aim to design more efficient experimental paradigms and methods that take full advantage of multimodal characteristics, enhancing the effectiveness and adaptability of multimodal fusion methods.

References

[1] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, p. 4338–4364, 2021.
[2] X. Xie, C. Lang, S. Miao, G. Cheng, K. Li, and J. Han, “Mutual-assistance learning for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, p. 15171–15184, 2023.
[3] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, “Towards large-scale small object detection: Survey and benchmarks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, p. 13467–13488, 2023.
[4] Z. Jia, X. Xu, J. Hu, and Y. Shi, “Low-power object-detection challenge on unmanned aerial vehicles,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1265–1266, 2022.
[5] S. Gao, Y. Wang, X. Gao, and B. Hong, “Visual and auditory brain–computer interfaces,” IEEE Transactions on Biomedical Engineering, vol. 61, no. 5, pp. 1436–1447, 2014.
[6] S. Lees, N. Dayan, H. Cecotti, P. Mccullagh, L. P. Maguire, F. Lotte, and D. H. Coyle, “A review of rapid serial visual presentation-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, 2018.
[7] Z. Lan, C. Yan, Z. Li, D. Tang, and X. Xiang, “MACRO: multi-attention convolutional recurrent model for subject-independent ERP detection,” IEEE Signal Processing Letters, vol. 28, pp. 1505–1509, 2021.
[8] Z. Li, C. Yan, Z. Lan, D. Tang, and X. Xiang, “MCGRAM: Linking multi-scale cnn with a graph-based recurrent attention model for subject-independent ERP detection,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 12, pp. 5199–5203, 2022.
[9] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.
[10] D. Liu, W. Dai, H. Zhang, X. **, J. Cao, and W. Kong, “Brain-machine coupled learning method for facial emotion recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, p. 10703–10717, 2023.
[11] W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12 695–12 705.
[12] W. Chango, J. A. Lara, R. Cerezo, and C. Romero, “A review on data fusion in multimodal learning analytics and educational data mining,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 12, 2022.
[13] Z. Wei, H. Pan, L. Qiao, X. Niu, P. Dong, and D. Li, “Cross-modal knowledge distillation in multi-modal fake news detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4733–4737.
[14] J. Guo, J. Zhang, S. Li, X. Zhang, and M. Ma, “Mtfd: Multi-teacher fusion distillation for compressed video action recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
[16] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4320–4328.
[17] W. Lin, Y. Li, Y. Ding, and H. Zheng, “Tree-structured auxiliary online knowledge distillation,” pp. 1–8, 2022.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
[21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
[22] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
[23] Y. Duan, Z. Li, X. Tao, Q. Li, S. Hu, and J. Lu, “EEG-based maritime object detection for IoT-driven surveillance systems in smart ocean,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9678–9687, 2020.
[24] X. Zheng and W. Chen, “An attention-based bi-LSTM method for visual object classification via EEG,” Biomedical Signal Processing and Control, vol. 63, p. 102174, 2021.
[25] C.-C. Tsai and W. Liang, “Event-related components are structurally represented by intrinsic event-related potentials,” Scientific Reports, vol. 11, no. 1, p. 5670, 2021.
[26] L. Fan, H. Shen, F. Xie, J. Su, Y. Yu, and D. Hu, “Dc-tcnn: A deep model for EEG-based detection of dim targets,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 30, pp. 1727–1736, 2022.
[27] N. Bigdely-Shamlo, A. Vankov, R. R. Ramirez, and S. Makeig, “Brain activity-based image classification from rapid serial visual presentation,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 16, no. 5, pp. 432–441, 2008.
[28] S. Zhang, Y. Wang, L. Zhang, and X. Gao, “A benchmark dataset for RSVP-based brain–computer interfaces,” Frontiers in Neuroscience, vol. 14, p. 568000, 2020.
[29] C. Barngrover, A. Althoff, P. DeGuzman, and R. Kastner, “A brain–computer interface (BCI) for the detection of mine-like objects in sidescan sonar imagery,” IEEE Journal of Oceanic Engineering, vol. 41, no. 1, pp. 123–138, 2015.
[30] L. Huang, Y. Zhao, Y. Zeng, and Z. Lin, “BHCR: RSVP target retrieval BCI framework coupling with CNN by a Bayesian method,” Neurocomputing, vol. 238, pp. 255–268, 2017.
[31] R. Manor, L. Mishali, and A. B. Geva, “Multimodal neural network for rapid serial visual presentation brain computer interface,” Frontiers in Computational Neuroscience, vol. 10, p. 130, 2016.
[32] C. Du, K. Fu, J. Li, and H. He, “Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 760–10 777, 2023.
[33] J. Zhu, C. Yang, X. Xie, S. Wei, Y. Li, X. Li, and B. Hu, “Mutual information based fusion model (MIBFM): mild depression recognition using EEG and pupil area signals,” IEEE Transactions on Affective Computing, 2022.
[34] W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2022.
[35] P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023.
[36] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6631–6640.
[37] S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13 325–13 333.
[38] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, “Improving sign language translation with monolingual data by sign back-translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1316–1325.
[39] N. Ding, S.-w. Tian, and L. Yu, “A multimodal fusion method for sarcasm detection based on late fusion,” Multimedia Tools and Applications, vol. 81, no. 6, pp. 8597–8616, 2022.
[40] R. Yang, S. Wang, Y. Sun, H. Zhang, Y. Liao, Y. Gu, B. Hou, and L. Jiao, “Multimodal fusion remote sensing image–audio retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 6220–6235, 2022.
[41] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 200–14 213, 2021.
[42] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2604–2613.
[43] X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 052–12 062.
[44] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2019, p. 2.
[45] M. Ryu, G. Lee, and K. Lee, “Knowledge distillation for bert unsupervised domain adaptation,” Knowledge and Information Systems, vol. 64, no. 11, pp. 3113–3128, 2022.
[46] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
[47] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4320–4328.
[48] S. Zhang, C. Tang, and C. Guan, “Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition,” Pattern Recognition, vol. 130, p. 108833, 2022.
[49] S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1285–1294.
[50] H. Zhang, D. Chen, and C. Wang, “Confidence-aware multi-teacher knowledge distillation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4498–4502.
[51] K. Kwon, H. Na, H. Lee, and N. S. Kim, “Adaptive knowledge distillation based on entropy,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7409–7413.
[52] Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 020–11 029.
[53] C. Li, G. Li, H. Zhang, and D. Ji, “Embedded mutual learning: A novel online distillation method integrating diverse knowledge sources,” Applied Intelligence, vol. 53, no. 10, pp. 11 524–11 537, 2023.
[54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
[55] G. E. Chatrian, E. Lettich, and P. L. Nelson, “Ten percent electrode system for topographic studies of spontaneous and evoked EEG activities,” American Journal of EEG Technology, vol. 25, no. 2, pp. 83–92, 1985.
[56] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
[57] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, no. 5, p. 056013, 2018.
[58] Y. Ding, N. Robinson, S. Zhang, Q. Zeng, and C. Guan, “Tsception: Capturing temporal dynamics and spatial asymmetry from EEG for emotion recognition,” IEEE Transactions on Affective Computing, 2022.
[59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, p. 6000–6010.
[60] X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 8238–8247.
[61] C. Yang, Z. An, H. Zhou, F. Zhuang, Y. Xu, and Q. Zhang, “Online knowledge distillation via mutual contrastive learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[62] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in 5th International Conference on Learning Representations (ICLR), 2017.
[63] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” University of Toronto, 05 2012.
[64] J. Rao, X. Meng, L. Ding, S. Qi, X. Liu, M. Zhang, and D. Tao, “Parameter-efficient and student-friendly knowledge distillation,” IEEE Transactions on Multimedia, 2023.
[65] N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledge transfer,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 268–284.
[66] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699, 2019.
[67] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3967–3976.

Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based Dim Object Detection

Abstract

Index Terms:

I Introduction

II Related Work

II-A Brain-Computer Based Object Recognition

II-B Multimodal Learning

II-C Knowledge Distillation

III System Design and Data Collection

III-A System Design

III-B Suspicious Region Detection

III-C ESSVP Paradigm

III-D Data Collection and Preprocessing

IV Methodology

IV-A Extraction of The Modality Representation

IV-A1 Visual Encoder

IV-A2 EEG Encoder

IV-B Fusion of Modality Representation

IV-C Mutual Learning with Online Knowledge Distillation

IV-D Mutual Learning with Adaptive Modality Balancing

IV-D1 Dynamic Weights for KD Losses

IV-D2 Dynamic Ratios for Backward Gradients

V Experimental Results

V-A Experimental Settings

V-A1 Configurations

V-A2 Performance Metrics

V-B Comparison Results

V-B1 Comparison results on ESSVP dataset

V-B2 Comparison results on CIFAR-100 dataset

V-C Ablation Analysis

V-C1 Data Analysis

V-C2 Visualization Analysis

V-D Adaptive Modality Balancing Analysis

V-D1 The effectiveness of the AMB

V-D2 The sensitivity of the AMB

V-E Generalization Analysis

V-E1 Generalization Analysis on Algorithm Level

V-E2 Generalization Analysis on Data Level

V-F Physical Verification

VI Conclusion

References

Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based
Dim Object Detection