Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based
Dim Object Detection

Zixing Li, Chao Yan, Zhen Lan, Dengqing Tang, Xiaojia Xiang, Han Zhou, Jun Lai Zixing Li, Zhen Lan, Dengqing Tang, Xiaojia Xiang, Han Zhou, Jun Lai are affiliated with the College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China, (e-mail: {lizixing16, lanzhen19, xiangxiaojia, tangdengqing09, zhouhan, laijun} @nudt.edu.cn). Zixing Li and Chao Yan contributed equally to this work.(Corresponding author: Dengqing Tang.)Chao Yan is affiliated with the College of Automation Engineering, Nan**g University of Aeronautics and Astronautics, Nan**g 211106, China, (e-mail: {[email protected]).
Abstract

Advanced cognition can be extracted from the human brain using brain-computer interfaces. Integrating these interfaces with computer vision techniques, which possess efficient feature extraction capabilities, can achieve more robust and accurate detection of dim targets in aerial images. However, existing target detection methods primarily concentrate on homogeneous data, lacking efficient and versatile processing capabilities for heterogeneous multimodal data. In this paper, we first build a brain-eye-computer based object detection system for aerial images under few-shot conditions. This system detects suspicious targets using region proposal networks, evokes the event-related potential (ERP) signal in electroencephalogram (EEG) through the eye-tracking-based slow serial visual presentation (ESSVP) paradigm, and constructs the EEG-image data pairs with eye movement data. Then, an adaptive modality balanced online knowledge distillation (AMBOKD) method is proposed to recognize dim objects with the EEG-image data. AMBOKD fuses EEG and image features using a multi-head attention module, establishing a new modality with comprehensive features. To enhance the performance and robust capability of the fusion modality, simultaneous training and mutual learning between modalities are enabled by end-to-end online knowledge distillation. During the learning process, an adaptive modality balancing module is proposed to ensure multimodal equilibrium by dynamically adjusting the weights of the importance and the training gradients across various modalities. The effectiveness and superiority of our method are demonstrated by comparing it with existing state-of-the-art methods. Additionally, experiments conducted on public datasets and system validations in real-world scenarios demonstrate the reliability and practicality of the proposed system and the designed method. The dataset and the source code can be found at: https://github.com/lizixing23/AMBOKD.

Index Terms:
object detection, EEG, multimodal learning, online knowledge distillation, adaptive modality balancing

I Introduction

In recent years, considerable progress has been made in the field of computer vision, primarily due to rapid advancements in deep learning [1, 2]. Remarkable results in different visual tasks, such as object recognition, image generation, and visual localization, have been achieved by combining deep learning with high-performance computers, well-designed neural networks, and large datasets [3]. However, accurately and robustly detecting dim objects in aerial images remains a difficult task because of factors such as cluttered backgrounds, varying observing angles, and small object scales [4]. Furthermore, the performance of dim object detector is further restricted by the limited training samples of sensitive objects in aerial images.

Refer to caption

Figure 1: Data extraction and processing procedure based on adaptive modality balanced online knowledge distillation (AMBOKD) method.

A promising avenue to address the above limitations is offered by brain-computer interfaces (BCIs). This technique is able to decode brain activity in response to events, providing insights into human cognitive processes [5]. Leveraging the prior knowledge and advanced cognitive abilities of humankind, BCIs can compensate the shortcomings of comuputer vision systems and achieve more robust and accurate recognition capabilities. Existing studies have demonstrated the potential of using event-related potential (ERP) detection for target recognition tasks. [6, 7, 8]. By detecting and interpreting ERPs elicited from electroencephalogram (EEG) signals, these works have shown effectiveness in target recognition tasks across different scenarios [6]. However, due to the low noise of the EEG signals, the accuracy of EEG-based target recognition approaches still needs to be enhanced.

To overcome the inherent shortcomings of computer vision-based approaches and BCIs, a straightforward idea is to combine these two approaches. Recently, multimodal learning has emerged as a powerful tools to incorporate the advantages of computer vision and EEG modalities by ingratiating diverse data and essential information [9]. Researchers have investigated multimodal learning through diverse strategies using various structures such as early fusion [10] at the feature level and late fusion [11] employing logistic regression. However, most current studies mainly focus on feature fusion or loss function, without thoroughly considering the relationships between various modalities, thereby impeding the promotion of mutual learning and synergistic effects among modalities [12]. Thus, extracting critical information from multimodal data (i.e., visual and EEG data) and improving recognition performance remains an ongoing challenge.

Knowledge distillation (KD), a widely used technique for knowledge transfer, offers a solution to the aforementioned challenge. KD provides a paradigm where the student model can learn more information from the teacher model, rather than relying solely on the correctness of the true labels. This kind of method has similarities with human learning, deriving various algorithms in the field of multimodal learning, such as cross-modal [13] and multi-teacher [14] KD, which improves performance and generalization while reducing the model parameters [15]. In order to simplify the distillation procedure and circumvent the disadvantages associated with conventional KD in terms of training cost and pipeline complexity, online knowledge distillation (OKD) has been proposed to facilitate mutual learning between models in the end-to-end training process [16] [17]. Nevertheless, most current OKD approaches depend on a single modality and are restricted to supporting homogeneous networks. These approaches fail to address the imbalance issues arising from modal differences during the online learning process, lacking the ability to extract and integrate multimodal information.

In this study, a brain-eye-computer based object detection system is built for detecting dim object in aerial images under few-shot conditions, which is able to leverage the strengths of humans in rapid cognitive and computer vision in data processing. This system uses the region proposal networks to detect suspicious object region, evokes the ERP signal with the eye-tracking-based slow serial visual presentation (ESSVP) paradigm, and consturcts EEG-image data pairs with eye movement data and recognizes them. In particular, an adaptive modality balanced online knowledge distillation (AMBOKD) method is proposed to process the EEG-image data and recognize the target image. As shown in Fig. 1, the experimental procedure begins with the acquisition of EEG and image data, which are input into their respective encoders. The outputs of these encoders are fused by the fusion module, which is considered the third modality. Subsequently, the visual, EEG, and fusion modalities exchange roles between teacher and student models. For example, when visual modality acts as the student model, EEG modality and fusion modality act as the teacher models. This setup facilitates mutual learning and collective optimization using the OKD method. During mutual learning, the adaptive modality balancing (AMB) module is designed to facilitate parameter optimization for all modalities by dynamically balancing the influence weights and training optimization level of each modality. Consequently, the proposed approach is well suited for multimodal heterogeneous tasks, enabling the fusion of 2D EEG signals and 3D visual image data, leading to robust and efficient performance in dim object recognition.

To the best of our knowledge, this study is the pioneering effort in addressing the challenges of dim object detection in aerial images using heterogeneous multimodal data. The main contributions of this study can be summarized as follows:

  • A brain-eye-computer based object detection system is established to obtain the subject’s attention region image and EEG data. This system is able to fully untilize the advantages of multi modalities, and detect the dim target in aerial images under few-shot conditions.

  • An AMBOKD method is proposed to fuse the multimodal data and enable simultaneous end-to-end mutual learning for target recognition. Under this method, an AMB module is introduced to adaptively balance the influence weights and training gradients of each modality, so as to ensure comprehensive parameter optimization.

  • A multimodal ESSVP dataset is built with 224×224 size RGB images and 59-channel EEG data. This dataset contains more than 13,000 paired samples of EEG and images, presenting promising prospects for the further development of multimodal learning methods in target detection tasks.

  • The superiority, robustness, and generalizability of our method is demonstrated by comparing it with state-of-the-art methods in a series of experiments. Our method not only improves the performance of the fusion model but also enhances the capabilities of both EEG and visual models, highlighting its potential in the field of multimodal learning.

This paper is structured as follows: Section. II discusses the related work about this study. Section. III presents the design of the built system and the ESSVP dataset in detail. Section. IV illustrates the proposed target recognition method. Section. V presents the experimental results and analyzes the performance. Section. VI draws conclusions and possible directions for future work.

II Related Work

II-A Brain-Computer Based Object Recognition

Computer-based target recognition approaches has been fully developed because of the rapid development of deep learning. The well-known neural network algorithms commonly employed in this domain include ResNet [18], MobileNetV2 [19], VGG [20], DenseNet [21], and EfficientNet [22]. Notably, ResNet [18], which can learn residual map**s, enables the network to effectively capture and represent complex features, leading to high-accuracy classification performance. EfficientNet [22] focuses on achieving high performance with computational efficiency, achieved through a meticulously designed network architecture and scaling method. However, the existing algorithms still suffer from false alarms and missed detection problems. These algorithms are highly dependent on the data set due to the influence of factors such as the target environment, training data and noise.

In recent years, EEG-based object recognition approaches have attracted considerable attention because of their ability to extract human cognition [23, 24]. These approaches depend on the analysis of ERP signals generated when subjects identify a target, providing enhanced robustness in intricate environments and minimizing the need for large amounts of image data [25, 7, 26]. The rapid serial visual presentation (RSVP) paradigm evokes ERP signals employed for target recognition by presenting visual image sequences as stimuli. For instance, Bigdely-Shamlo et al. [27] conducted an RSVP experiment where subjects were instructed to identify images containing airplanes at a frame rate of 12 Hz. They achieved high accuracies on 128-channel EEG data using independent component analysis (ICA), and reported promising results for single-trial classification. Lan et al. [7] proposed a multi-attention convolutional recurrent model for ERP detection, facilitating the identification of target images within RSVP sequences [28]. In addition, Fan et al. [26] introduced an asynchronous visual evoked paradigm (AVEP) based on RSVP, and proposed a deep learning method for detecting dim objects in satellite images. However, EEG signals are different on individual subjects and are prone to noise interference during signal extraction, thereby necessitating further improvements.

To harness the benefits of EEG and computer vision approaches, the fusion of EEG and image features has been investigated to accomplish efficient and robust object recognition. Barngrover et al. [29] developed a specific brain-computer Interface (BCI) system that addresses the challenge of target mine recognition inside-scan sonar images by fusing image and EEG signal features. Huang et al. [30] proposed a Bayesian HV-CV retrieval framework (BHCR) that combines human and computer vision using Bayesian approaches. The RSVP experimental paradigm was used to recognize targets, leading to effective retrieval of image databases. Minor et al. [31] introduced a multimodal neural network that combines EEG and image features. By fusing these modalities, they achieve high classification results in object recognition tasks. Previous studies have indicated the potential of integrating EEG and visual modalities to enhance the efficiency and robustness of recognition systems.

In line of this, our study aims to achieve dim object detection on aerial images by constructing the brain-eye-computer based object detection system with the ESSVP paradigm, which is build upon the RSVP and AVEP paradigms. This system detects the suspicious object region with the region proposal networks and employs eye-tracking technology to simultaneously extract the attention region and EEG signal fragment of the subject from the image. The proposed multimodal fusion approach outlined in this study then uses these two modal data sources as inputs.

II-B Multimodal Learning

Modeling and analyzing data from various sensory modalities, including images, speech, text, and EEG signals, are key aspects of multimodal learning [32, 33, 34]. Significant attention has been garnered by multimodal learning in the fields of artificial intelligence and machine learning in recent years, resulting in remarkable progress [9, 35]. Currently, different multimodal applications, such as human-computer interaction [33, 36], natural language translation [37, 38], and computer vision [39, 40, 34], are being applied in our lives.

Due to the development of deep learning, researchers propose novel loss functions or training approaches to optimize multimodal models. For instance, Nagrani et al. [41] introduced a transformer-based architecture that employs fusion bottlenecks for modality fusion at multiple layers. This method enables each modality to capture crucial information while efficiently sharing necessary information, leading to enhanced fusion performance and reduced computational cost. Du et al. [32] modeled the relationships between brain, visual, and linguistic features using multimodal deep generative models. They maximized both intra-modality and inter-modality mutual information regularization terms. Their approach addresses limitations such as the under-exploitation of multimodal knowledge and the scarcity of training data. Yang et al. [40] proposed a multimodal fusion approach for remote sensing image-audio retrieval tasks. They converted audio inputs to text and fused them with the text information to obtain a fusion representation. They optimized the common retrieval space using triplet loss, semantic loss, and consistency loss. Experimental findings on multiple datasets indicate the effectiveness of their approach.

Most of the current methods in this field primarily focus on feature fusion and loss function design when develo** multimodal fusion algorithms, thereby overlooking the potential benefits of knowledge transfer and mutual learning between different modalities. In contrast, we propose a dynamic hybrid fusion approach that integrates multimodal features and dynamically trains all branches in the model using an OKD method. This mutual learning approach for multimodal fusion aims to fully leverage the benefits of the visual and cognitive domains and extract crucial information effectively.

II-C Knowledge Distillation

Refer to caption

Figure 2: The data acquisition process of the brain-eye-computer based object detection system. First, the captured image data is pre-processed by the feature encoder and region proposal network to obtain the image with region proposal. Then, the ESSVP paradigm is constructed to elicit the ERPs in EEG signals. After that, the EEG cap collects the EEG signals and sends them to the computer through the signal amplifier. At last, the EEG and image data are combined as the paired data with the help of the eye tracking data and time-stamp synchronization signal from the eye tracker and the synchronization box.

KD has attracted considerable attention because of its capacity to compress models. This technique involves training a small student model by imitating the output distribution of a large teacher model [15]. KD has been extensively employed in different fields, including computer vision [42, 43], natural language processing [44, 45], and emotion recognition [36]. These appoarches can be classified into offline distillation and online distillation according to the distillation schemes [46]. Offline distillation typically follws a two-stage training process where a trained teacher model guides the training of a student model. However, this approach is expensive and knowledge transfer is unidirectional. In contrast, online approaches enable end-to-end training, allowing for simultaneous learning of teacher and student models while reducing training [47].

In the field of multimodal learning, attention has been drawn to offline knowledge distillation due to its knowledge transfer capabilities. For instance, Zhang et al. [48] proposed a visual-to-EEG cross-modal KD method that enhances continuous EEG prediction using dark knowledge from visual modality. Multi-teacher KD methods, including AvgMKD [49], CA-MKD [50], and EMKD [51], have been proposed to combine knowledge from various modalities and train the student model.

Current online knowledge distillation approaches have achieved significant progress with multimodal homogeneous data. For instance, Zhang et al. [47] introduced DML, demonstrating that student models can learn from each other through their predictions in deep mutual learning. Guo et al.[52] presented an OKD approach using collaborative learning with a weighted ensemble logit distribution. Li et al. [53] proposed an innovative embedded OKD approach that surpasses existing OKD approaches in image classification tasks. This approach leverages ensemble information, overall feature representations from peer networks, and logits to fully exploit the potential of networks.

However, there remains a research gap in online knowledge distillation methods for multimodal heterogeneous data, due to the challenges in extracting and fusing heterogeneous data features. Thus, we propose the AMBOKD method, which fully extracts features from both 2D EEG data and 3D image data, and treats the features output by the fusion model as a single modality for mutual learning. This method adaptively adjust the influences weights and the optimization level of each modality, leading to fully optimization of the parameters and surperior performance of the fusion model, thereby opening up new application scenarios for multimodal OKD.

III System Design and Data Collection

III-A System Design

Aiming at the problem of dim object detection in aerial images under few-shot conditions, we propose a brain-eye-computer based object detection system. As shown in Fig. 2, the system is designed according to EEG, vision, and eye movement properties. Firstly, the computer processes the image data through the feature extractor and the region proposal network (RPN) to obtain the image with the pre-detection box (see Section III-B). Subsequently, this part of the image is presented in the display screen through the eye-tracking-based slow serial visual presentation (ESSVP) paradigm (see Section III-C), which in turn induces the subject to generate the corresponding EEG signals. Through the real-time eye movement data recorded by the eye tracker and the time-stamp synchronization signal of the trigger box, the computer simultaneously extracts the EEG data recorded by the EEG cap and the image of the subject’s attention area at the corresponding time. After that, the paired data will be preprocessed, and recognized by the proposed AMBOKD method (see Section IV) . Finally, according to the eye movement data, the position of the target image in the original image is found for target localization, so as to relize the target detection task.

The hardwares of the system primarily serves the roles of sensor data acquisition and algorithm execution, comprising a 30-inch display with a 2K resolution, a computer equipped with an NVIDIA GeForce GTX 2070s GPU, a 64-channel EEG cap, signal amplifier, trigger box, router, eye tracker, and various data cables. The display and computer are key components, utilized for the playback of ESSVP paradigm and the real-time storage and processing of EEG, eye movement, and image data. The displayer is positioned 50similar-to\sim70 cm directly in front of the subjects to ensure maximal elicitation of ERP signals and more accurate collection of eye movement data. Positioned directly below the display at a 15-degree angle towards the user’s eyes, the eye tracker effectively captures gaze positions, transmitting them to the computer at a frequency of 50Hz. EEG data is collected via 64 wet electrodes in the EEG cap, with the signals amplified by a connected amplifier before being transmitted to the computer. The trigger box synchronizes various types of event data (e.g., auditory, visual, and program outputs) with neurophysiological data with high event precision (<1absent1<1< 1ms), serving as a cornerstone for subsequent data analysis by synchronizing EEG data with event-triggered label signals.

Refer to caption

Figure 3: The images with region proposals.

III-B Suspicious Region Detection

The RPN is an integral component in deep learning and computer vision, particularly in object detection tasks [54]. Operating by processing a feature map, the RPN employs a sliding window approach across various locations, scales, and aspect ratios using an anchor boxes. This network evaluates the likelihood of each anchor enclosing an object and adjusts its position and dimensions accordingly, generating precise region proposals.

As illustrated in Fig. 2, the designed system uses a pretrained ResNet50 feature encoder to obtain feature maps of image data, and employs RPN to generate region proposals, which are denoted with black boxes. The anchors used have scales of 16, 32, 48, 64, and 80, with aspect ratios of 0.5, 0.75, 1.0, 1.5, and 2.0. We employ UAVs to capture images of toy models in open outdoor settings, forming the training set. Meanwhile, the test set comprises images taken in complex real-world environments. The effectiveness of the employed feature extraction method and the RPN network is validated by the results, as depicted in Fig. 3. It is evident that, despite variations in scene environments and targets, the RPN network is capable of precisely delineating suspicious objects, thereby effectively supporting the continuation of subsequent experiments.

III-C ESSVP Paradigm

To elicit ERPs more efficiently and obtain the EEG-image data, we propose the ESSVP paradigm according to the RSVP [28] and the AVEP [26] experimental paradigm. In this paradigm, visual stimuli are presented at a slow and controlled pace, providing the subjects sufficient time to search for specific targets. By integrating eye movement technology, we can accurately track the observer’s gaze and determine the specific areas of interest during the dim target recognition task.

Refer to caption

Figure 4: Examples of stimulus materials.

The ESSVP paradigm firstly presents the experiment guidance for 1 min, and then the target example is shown to the subject for 3similar-to\sim4 s, as illustrated in Fig. 2. In this formal experiment, participants view 16 sequences of stimuli, each containing 50 images. These sequences are evenly divided into two sessions according to the region marking method. In each session, the first six sequences display images of toy models such as armored vehicles and airplanes, taken in simple scenes, with each image shown for 3 s. The last two sequences in each session present real images of real models, taken in complex scenes, with each image displayed for 4 s. The target in the first six sequences is the armored model, whereas in the last two sequences, the target is the vehicle. The different types of images used in the ESSVP paradigm are illustrated in Fig. 4.

The probability of the target image being included in each sequence is 40%. 1similar-to\sim2 nontarget images are randomly inserted between each target image to avoid the attentional blink caused by successive targets. During the stimulus presentation, we instruct subjects to actively search for the target among the candidate regions in the image, which is generated by the region proposal network in advance. Each candidate region must be fixated for 0.5 s. To prevent excessive visual fatigue due to prolonged task engagement, subjects are allowed to rest after each sequence, and the duration of the rest is determined by the subjects themselves.

The eye tracker continuously records the eye movement data of the subjects throughout the experiment. Specifically, the candidate box and image sequence data will be recorded if the subject fixates on a candidate area for more than 0.3s. In addition, to record the EEG signals of subjects during the corresponding time, the fixation event will be sent to the EEG signal acquisition system. Finally, the EEG-image data for the dim object task is generated collecting the EEG signals captured during the fixation periods and the image data from the subject’s attention area.

Refer to caption

Figure 5: The experimental configuration of ESSVP.

The experimental environment for ESSVP is illustrated in Fig. 5. Specifically, the EEG signals sampled at a rate of 1000 Hz are collected through 64 wet electrodes adhering to the 10similar-to\sim20 standard [55] on the EEG cap. The impedance of these electrodes is maintained below 10k. Throughout the experiment, subjects are seated comfortably in a quiet environment, positioned approximately 60 cm away from the display screen, facing the displayed image. To ensure signal quality, and we instruct participants to maintain stability and minimize body and head movements while observing the visual presentation paradigm. This directive aims to minimize potential noise interference to the EEG signals.

III-D Data Collection and Preprocessing

The ESSVP dataset used in this study includes EEG data from 10 subjects, all of whom are college students aged between 22 and 26, with normal vision and no history of mental illness. All subjects were provided with a clear understanding of the experimental procedure, task requirements, and brief information about the target characteristics before the experiment. In addition, the subjects signed an informed consent form. Ethical approval for the experiment was obtained from the relevant committee.

This dataset consists of 13405 samples, including 3880 positive samples with the target image and 9525 negative samples without the target image. Each sample includes EEG data from 59 electrode channels, as 5 out of the 64 electrodes are redundant and therefore excluded. The EEG data are filtered using a 2similar-to\sim30 Hz bandpass filter and downsampled from the original 1000 Hz to a rate of 250 Hz. The data is then segmented into 1.2 s (-500similar-to\sim700 ms) samples based on the trigger. Baseline correction is applied by employing the data from the first 200 ms interval as a reference to subtract the mean activity level. The image data used has a minimum pixel size of 150×150150150150\times 150150 × 150, corresponding to the center of the focused candidate box. All images are resized to 224×224224224224\times 224224 × 224 pixels to standardize the input for image network. The images in the validation set are preprocessed by introducing noise to simulate real-world conditions and evaluate the algorithm’s robustness, as shown in Fig. 6. Specifically, one-third of the images are augmented with 0.2 Gaussian noise and another third of the images are augmented with 0.2 salt and pepper noise.

Refer to caption

Figure 6: The noise effect on image data of validation set.

The EEG data are represented as a set of labeled samples {(MeiR59×300)i=1,2,,N}conditional-setsuperscriptsubscript𝑀𝑒𝑖superscript𝑅59300𝑖12𝑁\left\{\left(M_{e}^{i}\in R^{59\times 300}\right)\mid i=1,2,\ldots,N\right\}{ ( italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 59 × 300 end_POSTSUPERSCRIPT ) ∣ italic_i = 1 , 2 , … , italic_N } after the preprocessing steps. Each sample comprises a matrix with dimensions 59 (number of electrode channels) by 300 (number of time points), where the time points correspond to the segmented 1.2 s intervals that have undergone the bandpass filtering and baseline correction steps.

The image data are represented as a set of samples {(Mvi3×224×224)i=1,2,,N}conditional-setsuperscriptsubscript𝑀𝑣𝑖superscript3224224𝑖12𝑁\left\{\left(M_{v}^{i}\in\mathbb{R}^{3\times 224\times 224}\right)\mid i=1,2,% \ldots,N\right\}{ ( italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 224 × 224 end_POSTSUPERSCRIPT ) ∣ italic_i = 1 , 2 , … , italic_N }. Each sample comprises a tensor with dimensions 3 (number of color channels) by 224 (height) by 224 (width), corresponding to the size of the input images in the dataset.

Both the labeled EEG and image modality samples were collected as paired samples, forming sample pairs X={(Mvi,Mei,yi)i=1,2,,N}𝑋conditional-setsuperscriptsubscript𝑀𝑣𝑖superscriptsubscript𝑀𝑒𝑖subscript𝑦𝑖𝑖12𝑁X=\left\{\left(M_{v}^{i},M_{e}^{i},y_{i}\right)\mid i=1,2,\ldots,N\right\}italic_X = { ( italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i = 1 , 2 , … , italic_N } in the constructed ESSVP dataset, where N𝑁Nitalic_N is the total number of paired data, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding ground truth label. These sample pairs serve as the foundation for the analysis and assessment of the proposed approach for dim target recognition in aerial images.

IV Methodology

Refer to caption

Figure 7: Overview of the AMBOKD method for dim object recognition and multimodal data processing. First, the visual encoder and EEG encoder are used to extract preliminary representations from the two domains. Following this, the fusion domain incorporates the multi-head self-attention module to process the features from both domains. Subsequently, logits are calculated from three domains, facilitating the computation of the total loss and dynamic learning progress ratios for each modality. Finally, the gradients of each modality are dynamically modulated by the other two modalities and the truth label.

The framework of adaptive modality balanced online knowledge distillation (AMBOKD) method is illustrated in Fig. 7. Initially, AMBOKD receives paired data X𝑋Xitalic_X and extract EEG and visual features through the visual encoder and EEG encoder. These features are subsequently passed through the multi-head self-attention module to generate the fusion features, and the features from each domain are sent to their respective classifiers. The features extracted from the visual, EEG, and fusion domains are classified and employed to compute the losses of the three modalities using the cross-entropy (CE) and KD loss functions. Subsequently, the three modalities take turns as students and teachers to compute the total loss, the weight of the teacher influence, and the dynamic student gradient for parameter updating. The processes of the feature extraction, feature fusion, the online knowledge distillation, and the adaptive modality balancing are detailed as follows.

IV-A Extraction of The Modality Representation

The EfficientNet architecture [22] is employed as the visual encoder and MCGRAM [8] as the EEG encoder to extract the specific information of the visual images and EEG signals in dim target recognition. The focus of this initial extraction step is to capture the pertinent information from each modality effectively. The proposed approach maximizes the extraction of valuable information from visual images and EEG signals by designing dedicated networks for each modality. In addition, the logits obtained from the EEG and visual domains denote the independent recognition capability of the uni-encoder model.

IV-A1 Visual Encoder

EfficientNet [22], a well-designed convolutional neural network, has demonstrated outstanding advancements in accuracy and computational efficiency. This achievement is attributed to the effective balance of network depth, width, and resolution, resulting in enhanced recognition performance. In this study, EfficientNet-B0 was used because of its high efficiency, achieving comparable accuracy to ResNet-50 while requiring fewer parameters and floating point operations for the same input size.

Three main components constitute EfficientNet-B0: convolutional neural networks, batch normalization functions, and mobile inverted bottleneck convolution, inspired by the MobileNet [19] concept. MBConv consists of three types of layers: convolutional layers, depth-wise convolutional layers, and squeeze-and-excitation networks (SENet) [56]. The convolutional layer expands and compresses channels, whereas the depth-wise convolutional layer reduces the parameter count. However, SENet focuses on channel relationships by assigning importance weights to various channels.

The parameters of EfficientNet in AMBOKD were initialized using pre-training on the ImageNet dataset to ensure optimal performance and efficiency. This enables the leverage of knowledge gained from the ImageNet dataset for improved results in our specific task.

The effective representation of visual images from the visual encoder can be easily extracted with the help of EefficientNet. The representation of the visual domain is computed as follows:

Fv=EfficientNet(Mv).subscript𝐹𝑣EfficientNetsubscript𝑀𝑣F_{v}={\operatorname{EfficientNet}}\left({M_{v}}\right).italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = roman_EfficientNet ( italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) . (1)

IV-A2 EEG Encoder

MCGRAM [8] is a compact convolutional neural network specifically designed for EEG signals, which comprises three primary components: the frequency encoder, spatial encoder, and temporal encoder.

The frequency encoder can effectively learn frequency-related features and capture the spectral characteristics of EEG signals by employing specific kernels in the multi-scale convolution module. Spatial representations are learned by the spatial encoder using a graph convolution module, which exploits the inherent spatial relationships between various electrode channels, enabling the spatial patterns present in the EEG data to be captured by MCGRAM. To extract temporal features and obtain the final global representation of the EEG modality, and the temporal encoder uses a two-layer long short-term memory network and a self-attention module. This combination allows temporal dependencies to be effectively modeled and crucial temporal patterns in the EEG signals to be captured by the model.

In summary, MCGRAM can capture and encode relevant information from EEG signals by cascading frequency, spatial, and temporal blocks. This resulted in a comprehensive and informative representation for further analysis and classification tasks. The representation of the EEG modality is computed as:

Fe=MCGRAM(Me).subscript𝐹𝑒MCGRAMsubscript𝑀𝑒F_{e}={\operatorname{MCGRAM}}\left({M_{e}}\right).italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = roman_MCGRAM ( italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) . (2)

It should be noted that alternative algorithms can be employed as substitutes for the visual and EEG encoders. For example, algorithms such as ResNet-50 [18] and MobileNetV2 [19] can be used in the visual encoder, whereas EEGNet [57] and TSception [58] are viable options in the EEG encoder. The superiority and generalization of the AMBOKD approach proposed in this study can be effectively demonstrated by incorporating these alternative algorithms and conducting subsequent experimental validations (see Section V-E).

IV-B Fusion of Modality Representation

Refer to caption

Figure 8: Structure of the multi-head self-attention module for EEG and visual features fusion.

We develope a fusion model to integrate information from both EEG and visual modalities. This model is considered as novel modality that collaborates with the original modality in the mutual learning process. To focus on the most important information across various representations subspaces [59], a multihead self-attention module is employed within the fusion model. The creation of a global representation that captures the combined knowledge is facilitated by this module.

The multi-head self-attention module is designed to extract and align crucial information from EEG and visual modality features, as illustrated in Fig. 8. Dynamically learned weights are used to weigh and fuse the features of both modalities to obtain global fusion features.

Specifically, a fully connected (FC) layer is used to align the EEG feature Fefesubscript𝐹𝑒superscriptsuperscript𝑓𝑒F_{e}\in\mathbb{R}^{f^{e}}italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and visual feature Fvfvsubscript𝐹𝑣superscriptsuperscript𝑓𝑣F_{v}\in\mathbb{R}^{f^{v}}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT of each pair of data. To obtain the preliminary fused feature Fcfcsubscript𝐹𝑐superscriptsuperscript𝑓𝑐F_{c}\in\mathbb{R}^{f^{c}}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, the aligned features are then concatenated. Three sets of FC layers were used to independently transform the features Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into queries Q𝑄Qitalic_Q, keys K𝐾Kitalic_K, and values V𝑉Vitalic_V for the attention layer of the j𝑗jitalic_j-th head. The softmax function is used to multiply, scale, and normalize Q𝑄Qitalic_Q and K𝐾Kitalic_K to learn the attention scores Afc×fc𝐴superscriptsuperscript𝑓𝑐superscript𝑓𝑐A\in\mathbb{R}^{f^{c}\times f^{c}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT × italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT between the two modalities:

A=softmax(QKdl),𝐴softmax𝑄superscript𝐾topsubscript𝑑𝑙{A}=\operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_{l}}}\right),italic_A = roman_softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG end_ARG ) , (3)

where dlsubscript𝑑𝑙\sqrt{d_{l}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG is employed to scale the matrix multiplication result. Performing matrix multiplication between the values V𝑉Vitalic_V and the attention scores A𝐴Aitalic_A describes the output of the single-head attention as follows:

Hj=AjVj.subscript𝐻𝑗subscript𝐴𝑗subscript𝑉𝑗{H}_{j}=A_{j}V_{j}.italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (4)

Finally, to obtain the final fusion features Ffsubscript𝐹𝑓F_{f}italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, the outputs of all heads are concatenated and passed through an FC layer with a softmax function:

Ff=softmax(FC(||jJ(H1,H2,,HJ))),F_{f}=\operatorname{softmax}\left(\operatorname{FC}\left({||}_{j}^{J}\left(H_{% 1},H_{2},\dots,H_{J}\right)\right)\right),italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_softmax ( roman_FC ( | | start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) ) ) , (5)

where ||||| | denotes the concat function and J𝐽Jitalic_J represents the number of heads.

The features obtained from the three domains are classified separately using their respective classifiers. For example, considering the EEG domain, the classification process can be described as follows:

Ge=WeFe+be,subscriptG𝑒subscript𝑊𝑒subscript𝐹𝑒subscript𝑏𝑒\text{G}_{e}=W_{e}{F_{e}}+b_{e},G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , (6)

where Wesubscript𝑊𝑒W_{e}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes the weights and besubscript𝑏𝑒b_{e}italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the biases.

IV-C Mutual Learning with Online Knowledge Distillation

Refer to caption

Figure 9: Strategy of mutual learning with online knowledge distillation (OKD).

The OKD approach enables the teacher model and the student model to influence and adapt to each other during the training process. This dynamic interaction allows the teacher model to adjust the knowledge it transfers, thereby improving the student model’s performance. By employing the OKD approach to facilitate mutual learning between the modalities, different modalities can learn from each other and progress together throughout the training process, fully leveraging the strengths and improving the performance of each modality. Through iterative optimization of each modality, the optimal fusion model for the entire framework is obtained by incorporating the knowledge from all three modalities.

The training process starts by inputting the logits of the three domains, as illustrated in Fig. 9. Each modality is then treated as a student model sequentially. For each student modality, its CE loss LCESsuperscriptsubscript𝐿𝐶𝐸𝑆L_{CE}^{S}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and the KD losses LKDTasuperscriptsubscript𝐿𝐾𝐷subscript𝑇𝑎L_{KD}^{T_{a}}italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and LKDTbsuperscriptsubscript𝐿𝐾𝐷subscript𝑇𝑏L_{KD}^{T_{b}}italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from two teacher modalities are calculated. The CE loss (LCESsuperscriptsubscript𝐿𝐶𝐸𝑆L_{CE}^{S}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT) effectively quantifies the discrepancy between the predictions of the student model and the true labels, which is computed as follows:

CES=1Ni[yilog(pi)+(1yi)log(1pi)],superscriptsubscript𝐶𝐸𝑆1𝑁subscript𝑖delimited-[]subscript𝑦𝑖subscript𝑝𝑖1subscript𝑦𝑖1subscript𝑝𝑖\mathcal{L}_{CE}^{S}=-\frac{1}{N}\sum_{i}\left[y_{i}\log\left(p_{i}\right)+% \left(1-y_{i}\right)\log\left(1-p_{i}\right)\right],caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , (7)

where N𝑁Nitalic_N is the sample size of a batch, and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the probability assigned by the student model for sample i𝑖iitalic_i belonging to the positive class. Although CE loss provides valuable learning signals, it is insufficient for enabling various modalities to learn from each other. The KD loss function effectively quantifies the dissimilarity between the teacher and student models to address this limitation, thereby facilitating the distillation of knowledge from the teacher to the student. The KD loss is computed using the Kullback-Leibler (KL) divergences, given the teacher model:

KDT=1Niqilogqipi,superscriptsubscript𝐾𝐷𝑇1𝑁subscript𝑖subscript𝑞𝑖subscript𝑞𝑖subscript𝑝𝑖\mathcal{L}_{KD}^{T}=\frac{1}{N}\sum_{i}q_{i}\log\frac{q_{i}}{p_{i}},caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , (8)

where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the probability assigned by the teacher model for sample i𝑖iitalic_i belonging to the positive class.

By considering the CE loss and the KD loss, the overall loss totalsubscript𝑡𝑜𝑡𝑎𝑙\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT for each modality is obtained as follows:

total=CES+ατ2KDTa+βτ2KDTb,subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝐶𝐸𝑆𝛼superscript𝜏2superscriptsubscript𝐾𝐷subscript𝑇𝑎𝛽superscript𝜏2superscriptsubscript𝐾𝐷subscript𝑇𝑏\mathcal{L}_{total}=\mathcal{L}_{CE}^{S}+\alpha{\tau}^{2}\mathcal{L}_{KD}^{T_{% a}}+\beta{\tau}^{2}\mathcal{L}_{KD}^{T_{b}},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT + italic_α italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_β italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (9)

where τ𝜏\tauitalic_τ is the distillation temperature in knowledge distillation. α𝛼\alphaitalic_α and β𝛽\betaitalic_β represent the interaction weights that determine the influence contribution of distillation from the two teacher models and balance the magnitude difference between the CE loss and KD loss.

IV-D Mutual Learning with Adaptive Modality Balancing

To address imbalance optimization [60] in the learning process of each modality, an AMB module for mutual learning with OKD is proposed. There are two blocks in the module: dynamic weights for kd losses and dynamic ratios for backward gradients, detailed as follows.

IV-D1 Dynamic Weights for KD Losses

In the online knowledge distillation training process, the KD loss from the teacher model encapsulates the distilled knowledge. Most existing methods [49, 61, 53] compute the total training loss by adding the cross-entropy loss to the knowledge distillation loss, typically setting α𝛼\alphaitalic_α and β𝛽\betaitalic_β to 1 in Eq. (9). However, due to modality differences, simply averaging fails to effectively capture the superior knowledge from the teacher model. Furthermore, manually setting a fixed distillation loss weight often entails extensive experimentation and significant time costs.

To address the challenges mentioned above, we introduce a weight modulation block that dynamically calculate the weights for KD losses from different teacher models. This block efficiently extract knowledge during multimodal mutual learning. The entire workflow is described in Algorithm 1 in detail. The saturation function Sat()Sat\text{Sat}\left(\cdot\right)Sat ( ⋅ ) is defined as follows for a<b𝑎𝑏a<bitalic_a < italic_b: Sat(x,a,b)=aSat𝑥𝑎𝑏𝑎\text{Sat}\left(x,a,b\right)=aSat ( italic_x , italic_a , italic_b ) = italic_a if xa𝑥𝑎x\geq aitalic_x ≥ italic_a, Sat(x,a,b)=xSat𝑥𝑎𝑏𝑥\text{Sat}\left(x,a,b\right)=xSat ( italic_x , italic_a , italic_b ) = italic_x if axb𝑎𝑥𝑏a\geq x\geq bitalic_a ≥ italic_x ≥ italic_b, Sat(x,a,b)=bSat𝑥𝑎𝑏𝑏\text{Sat}\left(x,a,b\right)=bSat ( italic_x , italic_a , italic_b ) = italic_b if xb𝑥𝑏x\leq bitalic_x ≤ italic_b. The parameters αminsubscript𝛼\alpha_{\min}italic_α start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and αmaxsubscript𝛼\alpha_{\max}italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT denote the lower and upper bounds of α𝛼\alphaitalic_α, similar to βminsubscript𝛽\beta_{\min}italic_β start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and βmaxsubscript𝛽\beta_{\max}italic_β start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT for β𝛽\betaitalic_β.

Algorithm 1 Dynamic weights modulation process
0:  Training Dataset X𝑋Xitalic_X, iteration number T𝑇Titalic_T, batchsize Bmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, knowledge distillation temperature τ𝜏\tauitalic_τ
  for t=0,1,,T𝑡01𝑇t=0,1,\ldots,Titalic_t = 0 , 1 , … , italic_T do
     Sample minibatch Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from X𝑋Xitalic_X with batchsizeBmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
     Feed-forward the batched data Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the model
     Compute the CE losses CE(t)Ssuperscriptsubscript𝐶subscript𝐸𝑡𝑆\mathcal{L}_{CE_{(t)}}^{S}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, CE(t)Tasuperscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑎\mathcal{L}_{CE_{(t)}}^{T_{a}}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and CE(t)Tbsuperscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑏\mathcal{L}_{CE_{(t)}}^{T_{b}}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
     Compute the KD losses KD(t)Tasuperscriptsubscript𝐾subscript𝐷𝑡subscript𝑇𝑎\mathcal{L}_{KD_{(t)}}^{T_{a}}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and KD(t)Tbsuperscriptsubscript𝐾subscript𝐷𝑡subscript𝑇𝑏\mathcal{L}_{KD_{(t)}}^{T_{b}}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
     Compute α=Sat(CE(t)SCE(t)Ta,αmin,αmax)𝛼Satsuperscriptsubscript𝐶subscript𝐸𝑡𝑆superscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑎subscript𝛼subscript𝛼\alpha=\text{Sat}\bigg{(}\frac{\mathcal{L}_{CE_{(t)}}^{S}}{\mathcal{L}_{CE_{(t% )}}^{T_{a}}},{\alpha}_{\min},{\alpha}_{\max}\bigg{)}italic_α = Sat ( divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , italic_α start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )
     Compute β=Sat(CE(t)SCE(t)Tb,βmin,βmax)𝛽Satsuperscriptsubscript𝐶subscript𝐸𝑡𝑆superscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑏subscript𝛽subscript𝛽\beta=\text{Sat}\bigg{(}\frac{\mathcal{L}_{CE_{(t)}}^{S}}{\mathcal{L}_{CE_{(t)% }}^{T_{b}}},{\beta}_{\min},{\beta}_{\max}\bigg{)}italic_β = Sat ( divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , italic_β start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )
     Compute the total loss Ltotal(t)S=CE(t)S+ατ2KD(t)Ta+βτ2KD(t)Tbsuperscriptsubscript𝐿𝑡𝑜𝑡𝑎subscript𝑙𝑡𝑆superscriptsubscript𝐶subscript𝐸𝑡𝑆𝛼superscript𝜏2superscriptsubscript𝐾subscript𝐷𝑡subscript𝑇𝑎𝛽superscript𝜏2superscriptsubscript𝐾subscript𝐷𝑡subscript𝑇𝑏{L}_{{total}_{(t)}}^{S}=\mathcal{L}_{CE_{(t)}}^{S}+\alpha{\tau^{2}}\mathcal{L}% _{KD_{(t)}}^{T_{a}}+\beta{\tau^{2}}\mathcal{L}_{KD_{(t)}}^{T_{b}}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT + italic_α italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_β italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
  end for

IV-D2 Dynamic Ratios for Backward Gradients

Current multimodal methods lack the consideration of the modality imbalance optimization. Thus, we propose a dynamic gradient modulation block to dynamically balance the modal optimization level in multimodal mutual learning process, ensuring full optimization of different modalities.

We further analysis the imbalance optimization phenomenon during mutual learning by calculating the backward gradient of each modality. In the backward phase of the training process, when the fusion modality acts as the student, the gradient depends on the CE loss can be calculated as CEfGfisuperscriptsubscript𝐶𝐸𝑓superscriptsubscript𝐺𝑓𝑖\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f}^{i}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG using the following formulas. The CE loss of the fusion modality, denoted by CEfsuperscriptsubscript𝐶𝐸𝑓\mathcal{L}_{CE}^{f}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT, is given by

CEf=1NiNlogeGf(yi=c)ik=1MeGf(yi=k)i,superscriptsubscript𝐶𝐸𝑓1𝑁superscriptsubscript𝑖𝑁superscript𝑒superscriptsubscript𝐺𝑓subscript𝑦𝑖𝑐𝑖superscriptsubscript𝑘1𝑀superscript𝑒superscriptsubscript𝐺𝑓subscript𝑦𝑖𝑘𝑖\mathcal{L}_{CE}^{f}=-\frac{1}{N}\sum_{i}^{N}\log\frac{e^{G_{f\left(y_{i}=c% \right)}^{i}}}{\sum_{k=1}^{M}e^{G_{f\left(y_{i}=k\right)}^{i}}},caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG , (10)

where Gf(yi=c)isuperscriptsubscript𝐺𝑓subscript𝑦𝑖𝑐𝑖G_{f\left(y_{i}=c\right)}^{i}italic_G start_POSTSUBSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the predicted value when the true label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT equals class c𝑐citalic_c, and M𝑀Mitalic_M is the total number of categories. GfisuperscriptsubscriptG𝑓𝑖\operatorname{G}_{f}^{i}roman_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th logit output of the fusion model, calculated as

Gfi=Wf(Wfvϕv(θv,Fvi)+Wfeϕe(θe,Fei))+bf,superscriptsubscriptG𝑓𝑖subscript𝑊𝑓superscriptsubscript𝑊𝑓𝑣subscriptitalic-ϕ𝑣subscript𝜃𝑣superscriptsubscript𝐹𝑣𝑖superscriptsubscript𝑊𝑓𝑒subscriptitalic-ϕ𝑒subscript𝜃𝑒superscriptsubscript𝐹𝑒𝑖subscript𝑏𝑓\operatorname{G}_{f}^{i}=W_{f}\left(W_{f}^{v}\phi_{v}(\theta_{v},F_{v}^{i})+W_% {f}^{e}\phi_{e}(\theta_{e},F_{e}^{i})\right)+b_{f},roman_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , (11)

where ϕv(θv,)subscriptitalic-ϕ𝑣subscript𝜃𝑣\phi_{v}(\theta_{v},\cdot)italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ⋅ ) and ϕe(θe,)subscriptitalic-ϕ𝑒subscript𝜃𝑒\phi_{e}(\theta_{e},\cdot)italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , ⋅ ) represent the visual and EEG modality encoders, respectively, with θvsubscript𝜃𝑣\theta_{v}italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and θesubscript𝜃𝑒\theta_{e}italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as their parameters. Wfesuperscriptsubscript𝑊𝑓𝑒W_{f}^{e}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and Wfvsuperscriptsubscript𝑊𝑓𝑣W_{f}^{v}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT are the weight matrices that determine the relative significance of the features extracted from the visual and EEG modalities. Wfsubscript𝑊𝑓W_{f}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and bfsubscript𝑏𝑓b_{f}italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT act as the parameters in the fusion process.

Combined with Eq. (10) and Eq. (11), the gradient can be expressed as

CEfGf(yi=c)i=e(WfWfvϕvi+WfWfeϕei+bf)yi=ck=1Me(WfWfvϕvi+WfWfeϕei+bf)yi=k1yi=c,superscriptsubscript𝐶𝐸𝑓superscriptsubscript𝐺𝑓subscript𝑦𝑖𝑐𝑖superscript𝑒subscriptsubscript𝑊𝑓superscriptsubscript𝑊𝑓𝑣superscriptsubscriptitalic-ϕ𝑣𝑖subscript𝑊𝑓superscriptsubscript𝑊𝑓𝑒superscriptsubscriptitalic-ϕ𝑒𝑖subscript𝑏𝑓subscript𝑦𝑖𝑐superscriptsubscript𝑘1𝑀superscript𝑒subscriptsubscript𝑊𝑓superscriptsubscript𝑊𝑓𝑣superscriptsubscriptitalic-ϕ𝑣𝑖subscript𝑊𝑓superscriptsubscript𝑊𝑓𝑒superscriptsubscriptitalic-ϕ𝑒𝑖subscript𝑏𝑓subscript𝑦𝑖𝑘subscript1subscript𝑦𝑖𝑐\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f\left(y_{i}=c\right)}^{i}}=% \frac{e^{{\left(W_{f}W_{f}^{v}\phi_{v}^{i}+W_{f}W_{f}^{e}\phi_{e}^{i}+b_{f}% \right)}_{y_{i}=c}}}{\sum_{k=1}^{M}e^{{\left(W_{f}W_{f}^{v}\phi_{v}^{i}+W_{f}W% _{f}^{e}\phi_{e}^{i}+b_{f}\right)}_{y_{i}=k}}}-1_{y_{i}=c},divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_G start_POSTSUBSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_e start_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG - 1 start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c end_POSTSUBSCRIPT , (12)

ϕv(θv,Fvi)subscriptitalic-ϕ𝑣subscript𝜃𝑣superscriptsubscript𝐹𝑣𝑖\phi_{v}(\theta_{v},F_{v}^{i})italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) and ϕe(θe,Fei)subscriptitalic-ϕ𝑒subscript𝜃𝑒superscriptsubscript𝐹𝑒𝑖\phi_{e}(\theta_{e},F_{e}^{i})italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) are simplified as ϕvisuperscriptsubscriptitalic-ϕ𝑣𝑖\phi_{v}^{i}italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ϕeisuperscriptsubscriptitalic-ϕ𝑒𝑖\phi_{e}^{i}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for convenience. When a particular modality, such as the visual modality, exhibits classification performance, it contributes more to the gradient CEfGfisuperscriptsubscript𝐶𝐸𝑓superscriptsubscript𝐺𝑓𝑖\frac{\partial\mathcal{L}_{CE}^{f}}{\partial G_{f}^{i}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_G start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG through the expression WfWfvϕvisubscript𝑊𝑓superscriptsubscript𝑊𝑓𝑣superscriptsubscriptitalic-ϕ𝑣𝑖W_{f}W_{f}^{v}\phi_{v}^{i}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, thereby leading to a lower loss globally. Therefore, the EEG modality, characterized by lower confidence in accurate predictions, will obtain limited optimization during parameter updates via backpropagation. This phenomenon shows that the unimodal model might be overtrained or underoptimized when the training of the multimodal model is about to converge.

In this imbalanced training phenomenon, the disparate learning efficiencies of unimodality hinder the performance of the fusion model, resulting in under-optimized and overfitting representations that constrain the overall model performance. Thus, our proposed dynamic modulation block aims to dynamically balance the learning levels of modalities until they reach their optimal performance.

The updating process is presented in Algorithm 2. Specifically, the ratios R(t)Ssubscriptsuperscript𝑅𝑆𝑡R^{S}_{(t)}italic_R start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT, R(t)Tasubscriptsuperscript𝑅subscript𝑇𝑎𝑡R^{T_{a}}_{(t)}italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT and R(t)Tbsubscriptsuperscript𝑅subscript𝑇𝑏𝑡R^{T_{b}}_{(t)}italic_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT are computed for indicating the current training optimization level of each modality. To ensure consistent training optimization progress and maximize the completeness and effectiveness of the optimization in multimodal mutual learning, the dynamic learning progress ratio in t𝑡titalic_t-th iteration step is computed as:

R(t)DG={1if En=1Sat((R(t)Ta+R(t)Tb2×R(t)S)γ,Rmin,Rmax)if En>1superscriptsubscript𝑅𝑡𝐷𝐺cases1if subscript𝐸𝑛1Satsuperscriptsuperscriptsubscript𝑅𝑡subscript𝑇𝑎superscriptsubscript𝑅𝑡subscript𝑇𝑏2superscriptsubscript𝑅𝑡𝑆𝛾subscript𝑅subscript𝑅if subscript𝐸𝑛1R_{(t)}^{DG}=\begin{cases}1&\text{if }E_{n}=1\\ \text{Sat}\left(\big{(}\frac{R_{(t)}^{T_{a}}+R_{(t)}^{T_{b}}}{2\times R_{(t)}^% {S}}\big{)}^{\gamma},R_{\min},R_{\max}\right)&\text{if }E_{n}>1\end{cases}italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL Sat ( ( divide start_ARG italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 × italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 1 end_CELL end_ROW (13)

where Rminsubscript𝑅R_{\min}italic_R start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and Rmaxsubscript𝑅R_{\max}italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT represent the lower and upper bounds in the function, γ𝛾\gammaitalic_γ is a hyper-parameter to control the degree of modulation.

The extensively used adaptive moment estimation (Adam) optimization approach is employed in the backpropagation process. The primary idea underlying the Adam algorithm is to maintain a running average of the first-order moment (mean) and the second-order moment (variance) of the gradients. This facilitates the estimation of adaptive learning rates for various parameters during optimization. Using the dynamic modulation ratio R(t)DGsuperscriptsubscript𝑅𝑡𝐷𝐺R_{(t)}^{DG}italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT, the parameters θSsuperscript𝜃𝑆\theta^{S}italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT in the student model are optimized adaptively based on the modality learning progress. The updating process is expressed as

θ(t+1)Sθ(t)SR(t)DGηD(t)var+ϵD(t)meansubscriptsuperscript𝜃𝑆𝑡1subscriptsuperscript𝜃𝑆𝑡superscriptsubscript𝑅𝑡𝐷𝐺𝜂subscriptsuperscript𝐷var𝑡italic-ϵsubscriptsuperscript𝐷mean𝑡{\theta}^{S}_{(t+1)}\leftarrow{\theta}^{S}_{(t)}-R_{(t)}^{DG}\frac{\eta}{\sqrt% {D^{\rm{var}}_{(t)}}+\epsilon}\cdot D^{\rm{mean}}_{(t)}italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT ← italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT divide start_ARG italic_η end_ARG start_ARG square-root start_ARG italic_D start_POSTSUPERSCRIPT roman_var end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG ⋅ italic_D start_POSTSUPERSCRIPT roman_mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT (14)

where η𝜂\etaitalic_η is the learning rate and ϵitalic-ϵ\epsilonitalic_ϵ represents a small constant added to the denominator for numerical stability. D(t)meansubscriptsuperscript𝐷mean𝑡D^{\rm{mean}}_{(t)}italic_D start_POSTSUPERSCRIPT roman_mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT and D(t)varsubscriptsuperscript𝐷var𝑡D^{\rm{var}}_{(t)}italic_D start_POSTSUPERSCRIPT roman_var end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT represent the exponentially decaying average of the past gradients’ first moment estimate (mean) and second moment estimate (variance) at time t𝑡titalic_t.

Incorporating the AMB module into the loss computing process and the optimization process addresses the challenge of modality imbalance in the multimodal mutual learning process. This mechanism ensures that all modalities learn the effective knowledge and optimized at a comparable pace during mutual learning. Consequently, enhanced completeness, integration, and efficiency are achieved by the fusion model.

Algorithm 2 Dynamic gradients modulation process
0:  Training Dataset X𝑋Xitalic_X, iteration number T𝑇Titalic_T, batchsize Bmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, model parameters θSsuperscript𝜃𝑆\theta^{S}italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, learning rate η𝜂\etaitalic_η, hyper-parameter γ𝛾\gammaitalic_γ, current epoch Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
  for t=0,1,,T𝑡01𝑇t=0,1,\ldots,Titalic_t = 0 , 1 , … , italic_T do
     Sample minibatch Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from X𝑋Xitalic_X with batchsizeBmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
     Feed-forward the batched data Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the model
     Compute the losses CE(t)Ssuperscriptsubscript𝐶subscript𝐸𝑡𝑆\mathcal{L}_{CE_{(t)}}^{S}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, CE(t)Tasuperscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑎\mathcal{L}_{CE_{(t)}}^{T_{a}}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, CE(t)Tbsuperscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑏\mathcal{L}_{CE_{(t)}}^{T_{b}}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and total(x;θ(t)S)subscript𝑡𝑜𝑡𝑎𝑙𝑥superscriptsubscript𝜃𝑡𝑆\ell_{total}\left(x;\theta_{(t)}^{S}\right)roman_ℓ start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT ( italic_x ; italic_θ start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )
     if En==1E_{n}==1italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = = 1 then
        Set R(t)DG1superscriptsubscript𝑅𝑡𝐷𝐺1R_{(t)}^{DG}\leftarrow 1italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT ← 1
     else
        Update R(t)SbaseSCE(t)SbaseSsubscriptsuperscript𝑅𝑆𝑡superscriptsubscriptbase𝑆superscriptsubscript𝐶subscript𝐸𝑡𝑆superscriptsubscriptbase𝑆R^{S}_{(t)}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{S}-\mathcal{L}_{CE_{(t)}}^% {S}}{\mathcal{L}_{\rm{base}}^{S}}italic_R start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ← divide start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG, R(t)TabaseTaCE(t)TabaseTasuperscriptsubscript𝑅𝑡subscript𝑇𝑎superscriptsubscriptbasesubscript𝑇𝑎superscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑎superscriptsubscriptbasesubscript𝑇𝑎R_{(t)}^{T_{a}}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{T_{a}}-\mathcal{L}_{CE% _{(t)}}^{T_{a}}}{\mathcal{L}_{\rm{base}}^{T_{a}}}italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← divide start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG, R(t)TbbaseTbCE(t)TbbaseTbsuperscriptsubscript𝑅𝑡subscript𝑇𝑏superscriptsubscriptbasesubscript𝑇𝑏superscriptsubscript𝐶subscript𝐸𝑡subscript𝑇𝑏superscriptsubscriptbasesubscript𝑇𝑏R_{(t)}^{T_{b}}\leftarrow\frac{\mathcal{L}_{\rm{base}}^{T_{b}}-\mathcal{L}_{CE% _{(t)}}^{T_{b}}}{\mathcal{L}_{\rm{base}}^{T_{b}}}italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← divide start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_C italic_E start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG
        Compute R(t)DG=Sat((R(t)Ta+R(t)Tb2×R(t)S)γ,Rmin,Rmax)superscriptsubscript𝑅𝑡𝐷𝐺Satsuperscriptsuperscriptsubscript𝑅𝑡subscript𝑇𝑎superscriptsubscript𝑅𝑡subscript𝑇𝑏2superscriptsubscript𝑅𝑡𝑆𝛾subscript𝑅subscript𝑅R_{(t)}^{DG}=\text{Sat}\big{(}\big{(}\frac{R_{(t)}^{T_{a}}+R_{(t)}^{T_{b}}}{2% \times R_{(t)}^{S}}\big{)}^{\gamma},R_{\min},R_{\max}\big{)}italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT = Sat ( ( divide start_ARG italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 × italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )
     end if
     Compute g~(θ(t)S)=1BmxB(t)θStotal(x;θ(t)S)~𝑔superscriptsubscript𝜃𝑡𝑆1subscript𝐵𝑚subscript𝑥subscript𝐵𝑡subscriptsuperscript𝜃𝑆subscript𝑡𝑜𝑡𝑎𝑙𝑥superscriptsubscript𝜃𝑡𝑆\tilde{g}\left(\theta_{(t)}^{S}\right)=\frac{1}{B_{m}}\sum_{x\in B_{(t)}}% \nabla_{\theta^{S}}\ell_{total}\left(x;\theta_{(t)}^{S}\right)over~ start_ARG italic_g end_ARG ( italic_θ start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_B start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT ( italic_x ; italic_θ start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )
     Compute D(t)mean=β1D(t1)mean+(1β1)g~(θ(t)S)subscriptsuperscript𝐷mean𝑡subscript𝛽1subscriptsuperscript𝐷mean𝑡11subscript𝛽1~𝑔superscriptsubscript𝜃𝑡𝑆D^{\rm{mean}}_{(t)}=\beta_{1}\cdot D^{\rm{mean}}_{(t-1)}+\left(1-\beta_{1}% \right)\cdot\tilde{g}\big{(}\theta_{(t)}^{S}\big{)}italic_D start_POSTSUPERSCRIPT roman_mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_D start_POSTSUPERSCRIPT roman_mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ over~ start_ARG italic_g end_ARG ( italic_θ start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )
     Compute D(t)var=β2D(t1)var+(1β2)g~2(θ(t)S)subscriptsuperscript𝐷var𝑡subscript𝛽2subscriptsuperscript𝐷var𝑡11subscript𝛽2superscript~𝑔2superscriptsubscript𝜃𝑡𝑆D^{\rm{var}}_{(t)}=\beta_{2}\cdot D^{\rm{var}}_{(t-1)}+\left(1-\beta_{2}\right% )\cdot\tilde{g}^{2}\big{(}\theta_{(t)}^{S}\big{)}italic_D start_POSTSUPERSCRIPT roman_var end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_D start_POSTSUPERSCRIPT roman_var end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t - 1 ) end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT )
     Update θ(t+1)Sθ(t)SR(t)DGηD(t)var+ϵD(t)meansubscriptsuperscript𝜃𝑆𝑡1subscriptsuperscript𝜃𝑆𝑡superscriptsubscript𝑅𝑡𝐷𝐺𝜂subscriptsuperscript𝐷var𝑡italic-ϵsubscriptsuperscript𝐷mean𝑡{\theta}^{S}_{(t+1)}\leftarrow{\theta}^{S}_{(t)}-R_{(t)}^{DG}\frac{\eta}{\sqrt% {D^{\rm{var}}_{(t)}}+\epsilon}\cdot D^{\rm{mean}}_{(t)}italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT ← italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT divide start_ARG italic_η end_ARG start_ARG square-root start_ARG italic_D start_POSTSUPERSCRIPT roman_var end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG ⋅ italic_D start_POSTSUPERSCRIPT roman_mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT
  end for
  if En==1E_{n}==1italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = = 1 then
     Calculate average CE losses baseS,baseTa,baseTbsuperscriptsubscriptbase𝑆superscriptsubscriptbasesubscript𝑇𝑎superscriptsubscriptbasesubscript𝑇𝑏\mathcal{L}_{\rm{base}}^{S},\mathcal{L}_{\rm{base}}^{T_{a}},\mathcal{L}_{\rm{% base}}^{T_{b}}caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
  end if
TABLE I: The comparison studies of the baseline and state-of-art methods in subject-independent experiments
Method Session 1 Session 2
AUC/Std (%) F1/Std (%) ACC/Std (%) Precision/Std (%) p-value (AUC) AUC/Std (%) F1/Std (%) ACC/Std (%) Precision/Std (%) p-value (AUC)
EEGNet 70.65/6.55 69.12/6.07 71.68/4.98 69.84/6.14 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 68.86/5.46 67.44/5.76 69.51/5.71 68.00/5.20 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MCGRAM 82.47/7.11 75.18/8.47 76.69/6.55 76.04/9.48 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 80.44/5.32 72.33/6.12 73.59/5.81 74.34/4.95 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMBOKD-E* 83.56/6.61 77.66/8.21 78.59/7.19 78.25/8.31 - 81.85/5.71 75.95/5.75 76.71/5.09 77.05/5.07 -
ResNet-50 84.55/7.54 75.45/11.26 77.37/9.93 79.84/5.21 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 84.42/6.63 76.04/7.43 77.04/7.36 79.53/6.03 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
EfficientNet 78.81/8.89 81.82/2.79 83.96/2.40 87.02/1.47 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 84.13/3.92 80.68/4.63 82.81/4.15 86.53/2.34 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMBOKD-V* 88.54/3.50 83.08/2.52 84.88/2.25 87.54/1.49 - 87.88/5.22 82.22/3.84 83.96/3.59 87.18/2.15 -
MLB 86.56/3.69 82.31/2.73 84.21/2.40 86.66/1.62 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 84.74/4.42 80.95/7.85 82.74/6.13 85.20/3.64 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MKD 90.21/4.45 82.87/2.45 84.73/2.21 87.54/1.39 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 89.35/3.54 81.80/3.97 83.65/3.68 87.04/2.12 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
EMKD 90.51/4.22 82.65/2.42 84.57/2.18 87.42/1.34 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.93/3.50 80.95/4.00 83.00/3.68 86.62/2.06 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
CA-MKD 90.52/4.19 82.78/2.52 84.67/2.24 87.46/1.42 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 89.10/3.39 81.45/4.03 83.39/3.70 86.83/2.20 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
DML* 90.46/3.68 79.73/4.97 82.44/3.85 85.89/2.38 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.87/3.21 77.66/5.50 80.60/4.52 84.84/2.87 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
KDCL* 89.91/4.22 79.05/4.72 81.74/3.71 84.55/2.25 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 89.12/3.11 78.70/4.63 81.34/4.02 85.42/2.30 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
EML* 89.21/4.02 80.25/2.92 82.52/2.54 84.74/1.92 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 90.64/3.06 80.89/4.06 82.93/3.74 86.40/2.16 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMBOKD (ours)* 93.66/2.95 83.33/2.51 85.08/2.24 87.82/1.38 - 93.30/2.64 82.32/3.66 84.05/3.42 87.27/2.02 -

* Online knowledge distillation method

V Experimental Results

In this section, the experimental settings with the parameter configurations of our proposed approach are first presented. Subsequently, a comprehensive comparison and ablation study are conducted to evaluate the performance of the proposed AMBOKD approach in the dim object recognition task. Then, the effectiveness and generalisability of the OKD and AMB modules incorporated in AMBOKD are evaluated. In addition, the transfer experiments and system verifications by simulating real scenarios are designed to further verify the generalization performance of the proposed method and system under few-shot conditions.

V-A Experimental Settings

V-A1 Configurations

The visual encoder uses EfficientNet [22] as the basic approach for feature extraction from the input images. Furthermore, for ablation analysis, ResNet-50 [18] and MobileNetV2 [19] are used as alternative visual encoders. The EEG encoder is based on our previous study called MCGRAM [8], which is employed to extract features from the EEG data. In addition, TSception [58] and EEGNet [57] are used as alternative EEG encoders for the ablation analysis. The parameters of the visual and the EEG encoders are set to follow the configurations as described in the previous study, providing consistency and comparability with the existing studies. To align the features obtained from the visual and EEG encoders, FC layers are used to scale the lengths of both sets of features to a common length of 64. This ensures compatibility for subsequent processing and fusion of the multimodal features.

The number of heads is set to 2 in the multi-head self-attention module of the fusion model. The kernel sizes of the FC layers for alignment are configured as 1280×641280641280\times 641280 × 64 and 256×6425664256\times 64256 × 64 for visual and EEG features, respectively. Before computing the Q, K, and V matrices, the matrices in the FC layer have a size of 64×64646464\times 6464 × 64. The FC layers have a size of 128×21282128\times 2128 × 2 in the classifier of the Fusion domain.

The temperature coefficient τ𝜏\tauitalic_τ is set to 4 in the OKD approach to control the softness of the KD targets. The hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β are dynamically adjusted by the AMB module, which contributes to the regularization component of the overall loss function. The hyperparameter γ𝛾\gammaitalic_γ is set to 3 to fine-tune the sensitivity of the AMB module. In addition, we impose lower and upper limits, Rminsubscript𝑅R_{\min}italic_R start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and Rmaxsubscript𝑅R_{\max}italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT at 0.1 and 10, respectively, to confine the range of dynamic modulation ratio R(t)DGsubscriptsuperscript𝑅𝐷𝐺𝑡R^{DG}_{(t)}italic_R start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT during the training process.

The PyTorch framework is used to implement the AMBOKD model. The Adam optimizer with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT is used to optimize all network parameters. The model is trained for 15 epochs with a batch size of 64. The optimization loss function employed comprised the CE loss and the KD loss. A cross-validation approach is used to assess the performance of the AMBOKD model. The data of each subject are chosen as the validation set, whereas the data of the remaining nine subjects are employed as the training set. We repeat this process for each fold with five different seeds, and we obtain the final experimental finding by averaging the results of all the folds and seeds.

V-A2 Performance Metrics

A comprehensive set of metrics, including area under the receiver operating characteristic curve (AUC), accuracy (ACC), F1 score (F1) and Precision are used to evaluate the performance of classification models. Each metric is analyzed in terms of mean and standard deviation, offering insight into the reliability of the model and its ability to generalize across various datasets or conditions. Specifically, the AUC is emphasized as the main metric because of its effectiveness in measuring the discriminative ability of a model, especially in scenarios with unbalanced class distributions. To evaluate the object detection performance of the brain-eye-computer based system, AP@50:95 and AR@50:95 are used to measure the average precision and recall over Intersection Over Union (IOU) thresholds ranging from 0.50 to 0.95, while AP@50 and AR@50 are used for an IOU of 0.50. In addition, F1@50:95 and F1@50 provide a balanced view of accuracy by considering both precision and recall at these thresholds.

TABLE II: The subject-dependent results of the baseline and state-of-art methods in comparison studies.
Session Method AUC/Std (%)
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Subject 10 Average
1 EEGNet 63.57/6.44 64.12/6.68 61.60/7.72 54.55/7.16 60.14/8.58 61.49/7.17 54.22/5.27 61.66/6.21 69.59/7.25 56.30/6.83 60.72/6.93
MCGRAM 81.67/7.49 87.10/5.54 86.36/6.09 78.20/9.01 80.22/6.75 82.31/5.68 77.47/6.90 79.59/6.15 91.79/4.71 78.29/5.65 82.30/6.40
AMBOKD-E* 86.98/6.35 90.57/4.70 88.78/6.01 81.64/7.71 87.30/4.85 88.33/4.83 81.41/5.78 86.18/4.50 93.66/3.80 82.56/4.63 86.74/5.32
ResNet-50 83.22/6.03 81.73/6.00 81.67/5.53 79.56/6.87 80.75/6.48 77.86/7.12 83.37/6.43 79.29/6.69 81.68/5.75 80.85/6.42 81.00/6.33
EfficientNet 76.28/9.97 74.91/10.87 79.74/8.73 79.27/8.30 76.99/8.40 73.59/10.40 75.54/8.19 76.70/8.96 78.75/8.27 73.09/9.43 76.49/9.15
AMBOKD-V* 78.28/10.18 82.63/8.81 86.31/6.56 82.39/7.88 83.74/5.84 81.64/7.09 80.86/8.42 82.18/6.91 79.97/7.04 78.67/8.89 81.67/7.76
MLB 89.82/4.73 87.44/5.40 92.07/5.05 84.22/8.33 86.28/6.54 75.79/9.03 83.75/6.79 83.18/7.33 92.33/4.46 83.86/7.21 85.87/6.49
MKD 89.31/5.74 89.51/5.12 92.87/4.38 85.81/7.00 88.46/5.32 82.62/6.12 87.41/6.28 87.59/5.12 93.46/3.45 85.32/6.81 88.24/5.53
EMKD 90.38/4.99 89.56/5.15 92.93/4.30 85.73/6.72 88.70/5.15 82.20/6.06 86.62/6.62 86.38/5.89 94.06/3.24 85.28/6.92 88.18/5.50
CA-MKD 90.23/4.97 89.62/5.12 92.45/4.43 85.93/7.29 87.52/5.81 81.72/7.05 86.17/6.43 87.72/4.90 93.11/3.98 85.62/6.31 88.01/5.63
DML* 91.02/4.73 89.68/4.61 92.44/4.35 86.63/7.18 88.35/5.49 89.14/4.49 87.64/5.70 89.45/4.22 94.07/3.28 87.73/5.43 89.62/4.95
KDCL* 91.04/5.00 89.45/4.97 91.51/5.11 86.35/7.20 87.30/6.23 87.82/5.71 86.91/6.12 87.43/4.60 93.83/3.82 85.23/7.05 88.69/5.58
EML* 88.50/5.88 83.49/7.64 87.76/5.82 83.82/7.27 84.06/7.61 85.95/6.25 81.47/8.63 84.48/5.73 90.63/4.44 80.23/7.66 85.04/6.69
AMBOKD (ours)* 94.29/3.89 94.29/3.63 95.37/3.30 89.56/5.53 93.99/3.49 94.50/3.30 92.34/5.25 93.80/3.52 96.69/2.64 91.78/3.56 93.63/3.81
2 EEGNet 58.40/6.78 66.72/6.18 55.52/5.99 59.68/7.43 60.74/8.02 57.46/6.14 58.15/8.59 67.20/6.64 62.72/7.64 57.44/6.60 60.40/7.00
MCGRAM 69.29/8.21 86.73/5.49 72.86/6.83 79.39/7.15 79.54/7.29 78.40/7.25 74.39/7.99 86.32/4.77 90.01/4.01 77.03/5.56 79.40/6.46
AMBOKD-E* 74.87/7.51 90.37/4.73 78.00/6.94 87.15/4.67 85.99/6.60 82.78/7.19 79.37/6.62 90.21/3.56 92.50/3.17 83.13/5.11 88.39/5.61
ResNet-50 83.49/6.72 80.59/6.89 82.76/6.37 79.72/5.42 84.70/5.41 79.40/6.48 80.23/6.51 82.58/6.40 78.43/5.51 80.55/6.37 81.25/6.21
EfficientNet 73.77/9.28 80.58/8.90 73.05/10.96 73.54/11.58 74.76/10.13 75.96/10.85 74.12/8.66 74.42/8.60 75.33/10.74 72.65/10.40 74.82/10.01
AMBOKD-V* 85.02/6.04 86.06/6.74 86.33/72.74 83.79/6.95 82.37/7.33 83.66/7.08 78.47/72.58 74.87/7.78 79.45/7.52 77.37/8.52 81.74/7.25
MLB 78.24/8.62 90.62/6.12 80.23/6.53 85.27/5.69 82.99/7.49 77.27/8.72 85.00/5.92 83.19/5.93 89.99/5.39 85.18/5.83 83.80/6.62
MKD 81.98/7.78 92.82/3.91 79.38/7.58 84.04/6.68 85.08/6.22 81.48/7.88 85.15/6.48 87.82/5.19 91.60/4.86 85.29/6.52 85.46/6.31
EMKD 82.38/8.14 93.69/3.66 80.53/6.96 84.86/6.43 85.53/6.13 78.71/8.58 85.12/6.50 87.12/5.97 91.88/4.50 84.75/6.69 85.46/6.36
CA-MKD 86.01/8.41 94.19/3.82 85.09/5.96 84.16/6.64 84.05/6.99 79.97/8.15 85.35/6.22 87.40/5.16 93.72/3.69 84.40/7.17 86.43/6.22
DML* 81.09/8.21 89.87/4.85 81.62/6.94 85.18/5.94 88.33/5.25 87.97/4.57 85.89/6.29 89.82/3.82 92.58/4.01 83.76/6.28 86.61/5.62
KDCL* 79.15/8.96 86.05/6.04 78.99/6.65 83.08/7.13 86.93/5.31 87.12/4.45 84.96/6.85 88.41/3.94 92.74/4.12 85.20/6.32 85.26/5.98
EML* 81.09/8.18 83.99/7.24 79.19/6.87 81.25/7.49 87.51/4.86 85.89/6.38 83.90/6.09 87.89/4.76 89.21/5.30 83.53/6.52 84.34/6.37
AMBOKD (ours)* 89.56/5.20 96.97/2.40 90.37/4.76 93.94/3.61 94.09/4.00 93.08/3.99 90.54/4.64 95.33/2.44 96.58/2.99 91.83/3.96 93.23/3.80

* Online knowledge distillation method

V-B Comparison Results

V-B1 Comparison results on ESSVP dataset

To verify the effectiveness of the proposed AMBOKD approach, we conduct a series of comparison experiments with the baseline and state-of-art models, including unimodal approaches EEGNet [57], MCGRAM [8], ResNet-50 [18], EfficientNet [22], multimodal fusion approach MLB [62], multimodal knowledge distillation approaches MKD [49], EMKD [51], CA-MKD [50], multimodal online knowledge distillation approaches DML [47], KDCL [52], EML [53]. In addition, the EEG and visual modal model trained in AMBOKD are named AMBOKD-E and AMBOKD-V.

As indicated in Table I, our proposed AMBOKD achieve impressive results in the dim object recognition task of subject-independent experiments. This outcome surpasses the performance of baseline and state-of-the-art approaches. Meanwhile, the mutual learning mode with AMB module enhances the performance of EEG and visual modality models: all three metrics of AMBOKD-E and AMBOKD-V surpass those of the original unimodality models MCGRAM and Efficient-Net. The AUC-based ANOVA statistical test findings demonstrate that our proposed AMBOKD, AMBOKD-E, and AMBOKD-V significantly outperform all comparison approaches in subject-independent experiments (p<0.05𝑝0.05p<0.05italic_p < 0.05)

Table II shows the comparison results obtained from subject-dependent experiments. Notably, optimal results are achieved across all subjects using our approach, indicating significant advantages. This finding serves as a compelling demonstration of the effectiveness and adaptability of our method when applied to subject-specific experiments within practical scenarios.

These results lead to the conclusion that the AMBOKD method effectively combines the cognitive and visual domains, extracting fused global representations. Furthermore, it enables dynamic equilibrium mutual learning between unimodality models and the fusion modality during the training process, leading to improved performance in the fusion, EEG, and visual models.

V-B2 Comparison results on CIFAR-100 dataset

We conduct experiments on CIFAR-100 dataset [63] according to previous work [64]. The baseline methods include offline knowledge distillation methods (Vanilla [15], PKT [65]), representation knowledge distillation methods (CRD [66], RKD [67]), and online knowledge distillation methods (DML [47], KDCL [52], FT-KD [64], PESF-KD [64]). In the experiment, the combination of student and teacher networks includes ResNet56-ResNet20, ResNet110-ResNet32, ResNet56-VGG8, and VGG13-VGG8.

We use the same training setup as described in [64], employing an SGD optimizer with a momentum of 0.9, a batchsize of 64, a weight decay of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The initial learning rate is set at 0.05, with a decay factor of 10 every 30 epochs starting from epoch 150. For the AMB module, we specify a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The comparison results with three different random seeds, presented as acc(@1, mean/standard deviation), are shown in Table III. Our AMBOKD achieve the best performance among all compared methods, demonstrating its robustness and practicability.

TABLE III: Comparison results on CIFAR-100 Test set
Method ACC/Std (%)
ResNet56-20 ResNet110-32 VGG13-8 ResNet56-VGG8
Vanilla [15] 70.95/0.51 73.08/0.42 73.36/0.24 73.98/0.33
PKT [65] 71.27/- 73.67/- 73.40/- 74.10/-
CRD [66] 71.44/- 73.62/- 73.31/- 74.06/-
RKD [67] 71.47/- 73.53/- 74.15/- 73.35/-
KDCL [52] 70.11/- 72.87/- 73.99/- 73.16/-
DML [47] 71.40/- 72.21/- 74.18/- 73.86/-
FT-KD [64] 71.65/0.11 73.90/0.22 73.52/0.14 74.40/0.20
PESF-KD [64] 71.84/0.27 74.23/0.26 74.74/0.39 74.67/0.28
AMBOKD (ours) 72.26/0.16 74.35/0.27 74.95/0.26 74.84/0.27

V-C Ablation Analysis

Refer to caption

Figure 10: Visualization of latent representations by the t-SNE algorithm in the ablation studies.

We design ablation experiments and perform data analysis and visualization analysis on the results, respectively.

V-C1 Data Analysis

The impact of incorporating OKD with mutual learning and AMB in multimodal learning is investigated through a detailed ablation analysis. Specifically, the role of each component is examined by progressively simplifying the proposed AMBOKD approach. MMOKD-DG and MMOKD-DK are the varients of our method, which keep the part of dynamic weights block and the dynamic gradients block in the AMB module, seprately. Then, MMOKD omits the entire AMB component, serving as a baseline to evaluate the underlying OKD and mutual learning mechanism. MKD further eliminates the online training method, representing a multi-teacher KD framework. The V-KD and E-KD approaches are further simplifications, using single-teacher KD from the visual and EEG modalities, respectively. Lastly, AMM is the most basic form of our model, which depends solely on a multi-head self-attention mechanism for modality fusion, without any KD. In addition, our analysis incorporates the unimodality encoders MCGRAM for EEG and EfficientNet for the visual domain.

The ablation results are presented in Table IV. As can be seen, these uni-modality encoders MCGRAM and EfficientNet are outperformed by the AMM model, which leverages multi-head self-attention for fusion, thereby validating the multi-head self-attention mechanism’s capacity to effectively integrate multimodal data. The results of E-KD, V-KD and M-KD further verify the effectiveness of the knowledge distillation mechanism to improve the performance of the multimodal fusion model. MMOKD outperform MKD by 1.37% and 1.9% in AUC, which demonstrate the effectiveness of mutual learning method for achieving fully modal interaction and improving multimodal model performance. The results of the MMOKD-DG and MMOKD-DK methods are further improved compared with the MMOKD method, which indicates that the adaptive balance of gradients and the interaction between different modal models can provide a more effective training scheme Our latest method, AMBOKD fuses the above two dynamic balancing schemes and achieves the optimal results, 2.08% and 2.05% higher than MMOKD, demonstrating the significant potential of modal balancing in multimodal mutual learning tasks.

TABLE IV: The ablation studies of different modules in subject-independent experiments
Method Session 1 Session 2
AUC/Std (%) p-value (AUC) AUC/Std (%) p-value (AUC)
MCGRAM 82.47/7.11 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 80.44/5.32 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
EfficientNet 78.81/8.89 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 84.13/3.92 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMM 87.16/3.62 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 87.75/3.64 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
EKD 88.48/4.28 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 88.68/3.73 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
VKD 88.08/4.20 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 87.53/4.76 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MKD 90.21/4.45 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 89.35/3.54 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MMOKD 91.58/3.89 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 91.25/2.47 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MMOKD-DK 91.90/3.48 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 91.61/2.92 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
MMOKD-DG 92.31/3.54 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 92.59/2.21 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMBOKD (ours) 93.66/2.95 - 93.30/2.64 -

Refer to caption


Figure 11: The dynamic learning progress ratio changes of MMOKD, MMOKD-DG and AMBOKD in training process.

V-C2 Visualization Analysis

The latent representations of the models in the ablation analysis are visualized using the t-distributed stochastic neighbor embedding (t-SNE) approach to further verify the feature extraction and fusion capability of our proposed approach. As depicted in Fig. 10, the feature distributions of all subjects are displayed with various colors of dots, with red dots representing the features in the target domain, and dots in other colors representing the features in the nontarget class.

Specifically, unimodality models such as EfficientNet and MCGRAM exhibit distinct visual and EEG modality characteristics in their feature representations, as illustrated in Fig. 10. The visual-based EfficientNet model exhibits a more discrete feature representation, with a large intra-class distance, indicating its superior feature extraction performance but poor generalization performance of the classifier. Conversely, the feature representation of the MCGRAM model based on EEG is more clustered, with a small intra-class distance, indicating weaker feature extraction ability but stronger classifier generalization. As a multimodal fusion model, AMM effectively combines the benefits of vision and EEG modalities, thereby further reducing the inter-class distance while maintaining the feature representation capability. By integrating single-teacher guidance mechanisms into the multimodal fusion model, V-KD and E-KD models exhibit enhanced classification performance and generalization, as reflected in larger inter-class and smaller intra-class differences in feature representation. In addition, by introducing a multi-teacher mechanism, OKD approach, and multimodal AMB approach, the discrimination of feature representation extracted by MKD, MMOKD, and AMBOKD models become more pronounced and the feature representation distances within the same class become closer. This indicates that these mechanisms enhance the capability of multimodal fusion and effectively improve the classifier’s accuracy and generalization.

In conclusion, the effectiveness of the AMBOKD framework and its modules is effectively verified by the ablation study. The proposed AMBOKD approach outperforms other methods in the multimodal fusion task, exhibiting robust discrimination between positive and negative samples, an increase in inter-class distance, a reduction in intra-class distance, and statistically significant experimental findings (p<0.05𝑝0.05p<0.05italic_p < 0.05).

V-D Adaptive Modality Balancing Analysis

Refer to caption

Figure 12: Sensitivity analysis on AMB module.

Refer to caption

Figure 13: The performances of the uni-modality and the unimodality trained with AMBOKD in generalization analysis.

We then analysis the effectiveness of the AMB module and compare the best sensitivity control of the AMB module.

V-D1 The effectiveness of the AMB

We use the dynamic learning progress ratio R(t)DGsubscriptsuperscript𝑅𝐷𝐺𝑡R^{DG}_{(t)}italic_R start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT to assess the effectiveness of the modality equilibrium. The first training data from a 10-fold cross-validation technique with a seed value of 1 was utilized.

As illustrated in Fig. 11, we draw the changes of R(t)DGsubscriptsuperscript𝑅𝐷𝐺𝑡R^{DG}_{(t)}italic_R start_POSTSUPERSCRIPT italic_D italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT for each modality within MMOKD, MMOKD-DG, and AMBOKD. In the standard MMOKD approach, despite the intermodal learning capabilities, optimization is impeded by the inherent characteristics of each modality. Notably, the EEG modality progresses at a slower rate compared to the Visual and Fusion modalities, which exhibit quicker optimization speeds. This discrepancy in training speeds results in a suboptimal mutual learning effect among the modalities. The Dynamic Gradient (DG) block is able to addresses this issue by dynamically adjusting the training gradients according to each modality’s learning progression. With this adjustment, MMOKD-DG improves the optimization speed and the final optimization degree of each modality, especially EEG modality. AMBOKD further employs a dynamic weight modulation module tailored to KL loss, optimizing the faster-learning Visual and Fusion modalities initially, while also supporting sustained development in the EEG modality. This strategy ensures that all three modalities are optimally and uniformly enhanced throughout the training process, achieving superior overall performance.

V-D2 The sensitivity of the AMB

The sensitivity control of the AMB module is based on the hyperparameter γ𝛾\gammaitalic_γ, which can be found in Eq. (14). As shown in Fig. 12, the AMBOKD method with γ=3𝛾3\gamma=3italic_γ = 3 achieves the highest AUC in session 1 and second-highest AUC in session 2, resulting in an improved classification performance for imbalanced samples.

TABLE V: The subject-independent reuslts of the baseline and state-of-art methods in generalization analysis
Method Session 1 Session 2
AUC/Std (%) F1/Std (%) AUC/Std (%) F1/Std (%)
MLB 75.05/10.33 74.75/6.91 77.54/9.43 78.17/9.33
MKD 85.38/6.53 76.85/7.69 82.86/6.69 74.13/9.69
EMKD 86.56/6.65 77.08/7.49 83.61/7.24 74.78/9.04
CAMKD 86.57/6.82 75.64/6.97 84.02/6.97 73.03/10.38
DML 87.76/6.21 76.24/7.98 86.53/5.33 75.27/8.86
KDCL 88.13/6.44 77.75/7.32 86.91/5.29 74.98/7.92
EML 87.33/6.99 75.91/7.34 87.36/6.05 76.15/8.47
AMBOKD 88.52/5.57 80.34/7.70 87.76/5.53 78.98/7.88

V-E Generalization Analysis

To further analysis the generalization performance of the AMBOKD method, we conduct experiments on the algorithm level and the data level.

V-E1 Generalization Analysis on Algorithm Level

Comparative experiments are conducted by replacing the unimodal backbone networks used in the EEG and visual encoders. With these experiments, we are able to further validate the generalization capability of the proposed AMB module in multimodal learning but also its ability on improving unimodality performance. We test the following combinations: EEGNet + EfficientNet, TSception + EfficientNet, MCGRAM + MobileNet, MCGRAM + ResNet50, EEGNet + MobileNet, and TSception + ResNet50.

As shown in Fig. 13, the unimodality models trained with AMBOKD approach consistently achieve higher performances than the models trained by their own in all combinations. In addition, some uni-modal models, such as the EEGNet model in Fig. 13 (a)similar-to\sim(b), and the ResNet50 model in Fig. 13 (f), experienced considerable performance improvements compared with their original counterparts after dynamically trained with AMBOKD. For example, EEGNet trained under AMBOKD outperforms the original results in AUC by 7.7 %, ResNet50 trained under AMBOKD shows an improvement of 9.3 % in AUC (see Fig. 13 (a), (c)). These findings indicate the superiority and generalization capabilities of the AMBOKD in multimodal learning tasks.

Refer to caption


Figure 14: The brain-eye-computer system verification for the task of dim object recognition while monitoring unmanned aerial vehicles.

V-E2 Generalization Analysis on Data Level

We verify the generalization of the AMBOKD method by conducting transfer experiments under few-sample conditions. In this experiment, we utilize the optimally saved model that is specifically trained for armored vehicle recognition as detailed in Section  V-B1. This model is further refined using 30 real-world scene samples and is deployed to detect vehicle targets within these real scenes. The comparative results, presented in Table V, demonstrate that our methods achieved superior performance in both session 1 and session 2. This case effectively verifies the effectiveness and generalization ability of our method.

Refer to caption

Figure 15: The results on physical verification experiment

V-F Physical Verification

To further verify the pratical and effectiveness of our build system and proposed method, we carry our physical experiments combined with the application scenario of UAV ground control station operators, as illustrated in Fig.14. During the system experiment, subjects are positoned in front of a computer screen with proper body alignment, focusing their gaze directly on the frontal display while performing the ESSVP paradigm. They are tasked with controlling the system and issuing appropriate commands to the UAV upon detecting abnormal displays on the UAV monitoring interface located to the participant’s left. Concurrently, the ESSVP paradigm temporarily halts the playback of visual stimuli to accommodate this interaction.

In the system verification experiment, the models refined in SectionV-E are employed directly to process the multimodal data and detect the target. As shown in Fig. 15, the AMBOKD method achieves the highest performances in all metrics compare with the baseline and state-of-art methods. This experimental findings demonstrate the effectiveness of the brain-eye-computer based object detection system in real-world settings, indicating robustness and high accuracy in practical tasks.

TABLE VI: Comparison between human, computer, and the brain-eye-computer based system
Method AP@50:95 (%) AR@50:95 (%) F1@50:95 (%) Total Time (s)
Human 78.28/18.20 67.33/20.26 72.00/18.71 1026.00/362.00
Computer 23.44/5.49 30.12/6.89 26.36/6.10 3.00/0
Brain-eye-computer 89.26/6.76 90.32/4.01 87.98/6.37 245.37/1.38

Further more, we compared our brain-eye-computer based system with traditional manual and computer vision approaches, as shown in Table VI. In the human-based experiment, five subjects were asked to detect and annotate vehicle targets in 60 aerial images. The results indicate variability in detection time and accuracy among subjects, underscoring the inefficiency and unreliability of purely manual methods in complex scenarios. In the computer-based experiment, we adopt the same training setup as the brain-eye-computer based system, a Faster RCNN model with a ResNet50 backbone trained on the public dataset was fine-tuned with 30 samples and then tested on 60 images. Although this method achieve rapid detection capabilities, its accuracy was notably compromised due to the limited training samples. In contrast, our integrated brain-eye-computer system achieved superior performance, registering a total processing time of 245.37 s (240 s for the ESSVP paradigm and 5.37 s for processing). This result not only demonstrates the efficiency of the system, but also its robust generalization capabilities across different test conditions.

VI Conclusion

This study builds a brain-eye-computer based system for object detection in aerial images under few-shot conditions. This system detects suspicious targets in aerial images using a region proposal network. After obtaining the images with the region proposals, it elicits subjects’ ERP signals during target search with the ESSVP paradigm and constructs EEG-image data pairs incorporating eye movement data. These pairs are then recognized using the proposed AMBOKD method. AMBOKD fully extracts and fuses crucial information from EEG and visual modality features, facilitates end-to-end mutual learning, and improves adaptive multimodal interaction capability through the AMB module. Experimental results demonstrates the effectiveness and superiority of our method, and its ability in improving the performance of uni-modality models. Lastly, the feasibility and transferability of the AMBOKD method and the brain-eye-computer based system are verified through experiments with practical scenario images. This study opens new possibilities for robust and efficient dim object detection in aerial applications.

In the future, we will further develop real-time online systems across a wider range of scenarios. We aim to design more efficient experimental paradigms and methods that take full advantage of multimodal characteristics, enhancing the effectiveness and adaptability of multimodal fusion methods.

References

  • [1] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, p. 4338–4364, 2021.
  • [2] X. Xie, C. Lang, S. Miao, G. Cheng, K. Li, and J. Han, “Mutual-assistance learning for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, p. 15171–15184, 2023.
  • [3] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, “Towards large-scale small object detection: Survey and benchmarks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, p. 13467–13488, 2023.
  • [4] Z. Jia, X. Xu, J. Hu, and Y. Shi, “Low-power object-detection challenge on unmanned aerial vehicles,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1265–1266, 2022.
  • [5] S. Gao, Y. Wang, X. Gao, and B. Hong, “Visual and auditory brain–computer interfaces,” IEEE Transactions on Biomedical Engineering, vol. 61, no. 5, pp. 1436–1447, 2014.
  • [6] S. Lees, N. Dayan, H. Cecotti, P. Mccullagh, L. P. Maguire, F. Lotte, and D. H. Coyle, “A review of rapid serial visual presentation-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, 2018.
  • [7] Z. Lan, C. Yan, Z. Li, D. Tang, and X. Xiang, “MACRO: multi-attention convolutional recurrent model for subject-independent ERP detection,” IEEE Signal Processing Letters, vol. 28, pp. 1505–1509, 2021.
  • [8] Z. Li, C. Yan, Z. Lan, D. Tang, and X. Xiang, “MCGRAM: Linking multi-scale cnn with a graph-based recurrent attention model for subject-independent ERP detection,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 12, pp. 5199–5203, 2022.
  • [9] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.
  • [10] D. Liu, W. Dai, H. Zhang, X. **, J. Cao, and W. Kong, “Brain-machine coupled learning method for facial emotion recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, p. 10703–10717, 2023.
  • [11] W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12 695–12 705.
  • [12] W. Chango, J. A. Lara, R. Cerezo, and C. Romero, “A review on data fusion in multimodal learning analytics and educational data mining,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 12, 2022.
  • [13] Z. Wei, H. Pan, L. Qiao, X. Niu, P. Dong, and D. Li, “Cross-modal knowledge distillation in multi-modal fake news detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4733–4737.
  • [14] J. Guo, J. Zhang, S. Li, X. Zhang, and M. Ma, “Mtfd: Multi-teacher fusion distillation for compressed video action recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [16] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4320–4328.
  • [17] W. Lin, Y. Li, Y. Ding, and H. Zheng, “Tree-structured auxiliary online knowledge distillation,” pp. 1–8, 2022.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
  • [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
  • [21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
  • [22] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
  • [23] Y. Duan, Z. Li, X. Tao, Q. Li, S. Hu, and J. Lu, “EEG-based maritime object detection for IoT-driven surveillance systems in smart ocean,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9678–9687, 2020.
  • [24] X. Zheng and W. Chen, “An attention-based bi-LSTM method for visual object classification via EEG,” Biomedical Signal Processing and Control, vol. 63, p. 102174, 2021.
  • [25] C.-C. Tsai and W. Liang, “Event-related components are structurally represented by intrinsic event-related potentials,” Scientific Reports, vol. 11, no. 1, p. 5670, 2021.
  • [26] L. Fan, H. Shen, F. Xie, J. Su, Y. Yu, and D. Hu, “Dc-tcnn: A deep model for EEG-based detection of dim targets,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 30, pp. 1727–1736, 2022.
  • [27] N. Bigdely-Shamlo, A. Vankov, R. R. Ramirez, and S. Makeig, “Brain activity-based image classification from rapid serial visual presentation,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 16, no. 5, pp. 432–441, 2008.
  • [28] S. Zhang, Y. Wang, L. Zhang, and X. Gao, “A benchmark dataset for RSVP-based brain–computer interfaces,” Frontiers in Neuroscience, vol. 14, p. 568000, 2020.
  • [29] C. Barngrover, A. Althoff, P. DeGuzman, and R. Kastner, “A brain–computer interface (BCI) for the detection of mine-like objects in sidescan sonar imagery,” IEEE Journal of Oceanic Engineering, vol. 41, no. 1, pp. 123–138, 2015.
  • [30] L. Huang, Y. Zhao, Y. Zeng, and Z. Lin, “BHCR: RSVP target retrieval BCI framework coupling with CNN by a Bayesian method,” Neurocomputing, vol. 238, pp. 255–268, 2017.
  • [31] R. Manor, L. Mishali, and A. B. Geva, “Multimodal neural network for rapid serial visual presentation brain computer interface,” Frontiers in Computational Neuroscience, vol. 10, p. 130, 2016.
  • [32] C. Du, K. Fu, J. Li, and H. He, “Decoding visual neural representations by multimodal learning of brain-visual-linguistic features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 760–10 777, 2023.
  • [33] J. Zhu, C. Yang, X. Xie, S. Wei, Y. Li, X. Li, and B. Hu, “Mutual information based fusion model (MIBFM): mild depression recognition using EEG and pupil area signals,” IEEE Transactions on Affective Computing, 2022.
  • [34] W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2022.
  • [35] P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023.
  • [36] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6631–6640.
  • [37] S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13 325–13 333.
  • [38] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, “Improving sign language translation with monolingual data by sign back-translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1316–1325.
  • [39] N. Ding, S.-w. Tian, and L. Yu, “A multimodal fusion method for sarcasm detection based on late fusion,” Multimedia Tools and Applications, vol. 81, no. 6, pp. 8597–8616, 2022.
  • [40] R. Yang, S. Wang, Y. Sun, H. Zhang, Y. Liao, Y. Gu, B. Hou, and L. Jiao, “Multimodal fusion remote sensing image–audio retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 6220–6235, 2022.
  • [41] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 200–14 213, 2021.
  • [42] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2604–2613.
  • [43] X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 052–12 062.
  • [44] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2019, p. 2.
  • [45] M. Ryu, G. Lee, and K. Lee, “Knowledge distillation for bert unsupervised domain adaptation,” Knowledge and Information Systems, vol. 64, no. 11, pp. 3113–3128, 2022.
  • [46] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
  • [47] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4320–4328.
  • [48] S. Zhang, C. Tang, and C. Guan, “Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition,” Pattern Recognition, vol. 130, p. 108833, 2022.
  • [49] S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1285–1294.
  • [50] H. Zhang, D. Chen, and C. Wang, “Confidence-aware multi-teacher knowledge distillation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 4498–4502.
  • [51] K. Kwon, H. Na, H. Lee, and N. S. Kim, “Adaptive knowledge distillation based on entropy,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7409–7413.
  • [52] Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 020–11 029.
  • [53] C. Li, G. Li, H. Zhang, and D. Ji, “Embedded mutual learning: A novel online distillation method integrating diverse knowledge sources,” Applied Intelligence, vol. 53, no. 10, pp. 11 524–11 537, 2023.
  • [54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
  • [55] G. E. Chatrian, E. Lettich, and P. L. Nelson, “Ten percent electrode system for topographic studies of spontaneous and evoked EEG activities,” American Journal of EEG Technology, vol. 25, no. 2, pp. 83–92, 1985.
  • [56] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
  • [57] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, no. 5, p. 056013, 2018.
  • [58] Y. Ding, N. Robinson, S. Zhang, Q. Zeng, and C. Guan, “Tsception: Capturing temporal dynamics and spatial asymmetry from EEG for emotion recognition,” IEEE Transactions on Affective Computing, 2022.
  • [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, p. 6000–6010.
  • [60] X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 8238–8247.
  • [61] C. Yang, Z. An, H. Zhou, F. Zhuang, Y. Xu, and Q. Zhang, “Online knowledge distillation via mutual contrastive learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [62] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in 5th International Conference on Learning Representations (ICLR), 2017.
  • [63] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” University of Toronto, 05 2012.
  • [64] J. Rao, X. Meng, L. Ding, S. Qi, X. Liu, M. Zhang, and D. Tao, “Parameter-efficient and student-friendly knowledge distillation,” IEEE Transactions on Multimedia, 2023.
  • [65] N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledge transfer,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 268–284.
  • [66] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699, 2019.
  • [67] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3967–3976.
[Uncaptioned image] Zixing Li received the B.E. degree in electronics engineering and M.S degree in electronic information from the National University of Defense Technology, Changsha, China, in 2020 and 2022. He is currently pursuing the Ph.D. degree in control science and engineering with the College of Intelligence Science and Technology at the National University of Defense Technology. His research interests include brain-computer interface, EEG signal processing, and multi-modal learning.
[Uncaptioned image] Chao Yan received the B.E. degree in electrical engineering and automation from China University of Mining and Technology, Xuzhou, China, in 2017, and the M.S. and Ph.D. degrees in control science and engineering from the National University of Defense Technology, Changsha, China, in 2019, and 2023, respectively. He was a visiting Ph.D. student with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore, from 2021 to 2022. He is currently an Associate Professor with the College of Automation Engineering, Nan**g University of Aeronautics and Astronautics, Nan**g, China. His research interests include deep reinforcement learning, coordination control of UAV swarms.
[Uncaptioned image] Zhen Lan received the B.E. degree in automation from the School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, China, in 2019. She is currently pursuing the Ph.D. degree in control science and engineering with the College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China. Her research interests include brain-computer interface, multi-modal learning, and UAV control.
[Uncaptioned image] Dengqing Tang received the B.Eng., M.S. and Ph.D. degrees in control science and engineering from National University of Defense Technology, Changsha, China, in 2013, 2016 and 2019, respectively. He is currently an Associate Professor in the College of Intelligence Science and Engineering, National University of Defense Technology. His research interests cover visual object detection, visual object pose estimation and deep learning.
[Uncaptioned image] Xiaojia Xiang received the B.E., M.S., and Ph.D. degrees in control science and engineering from the National University of Defense Technology (NUDT), Changsha, China, in 2003, 2007, and 2016, respectively. He is currently a Professor at the College of Intelligence Science and Technology, NUDT. His research interests include mission planning, autonomous and cooperative control of unmanned systems.
[Uncaptioned image] Han Zhou received the Ph.D. degree in control science and engineering from the National University of Defense Technology (NUDT), Changsha, China, in 2015. She is currently an associate professor with the College of Intelligence Science and Technology, National University of Defense Technology. Her research interests include biomimetic robotics, collective intelligence, and learning control.
[Uncaptioned image] Jun Lai received the B.E., and Ph.D. degrees in instrument science and technology from the National University of Defense Technology (NUDT), Changsha, China, in 2013, and 2019, respectively. He is currently an assistant professor with the College of Intelligence Science and Technology, NUDT. He is currently in the field of cooperative localization and cooperative mission planning of unmanned system.