NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

Yuqi Ma Huamin Wang Hangchi Shen Xuemei Chen Shukai Duan Shi** Wen
Abstract

Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened complexity in feature extraction due to their distinctive time series and sparsity characteristics, which influences their classification accuracy. To overcome this challenge, a novel approach termed Neuromorphic Momentum Contrast Learning (NeuroMoCo) for SNNs is introduced in this paper by extending the benefits of self-supervised pre-training to SNNs to effectively stimulate their potential. This is the first time that self-supervised learning (SSL) based on momentum contrastive learning is realized in SNNs. In addition, we devise a novel loss function named MixInfoNCE tailored to their temporal characteristics to further increase the classification accuracy of neuromorphic datasets, which is verified through rigorous ablation experiments. Finally, experiments on DVS-CIFAR10, DVS128Gesture and N-Caltech101 have shown that NeuroMoCo of this paper establishes new state-of-the-art (SOTA) benchmarks: 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256), and 84.4% (SEW-ResNet-18), respectively.

keywords:
Spiking neural networks , Contrastive learning , Self-supervised pre-training , Neuromorphic datasets , Image classification
\affiliation

[1] organization=Southwest University,city=Chongqing, postcode=400715, country=China

\affiliation

[2] organization=Chongqing Key Laboratory of Brain Inspired Computing and Intelligent Chips,city=Chongqing, postcode=400715, country=China

\affiliation

[3] organization=University of Technology Sydney,addressline=Australian Institute of Artificial Ieintelligence, city=Sydney, postcode=2007, country=Australia

1 Introduction

Spiking nueral networks (SNNs) have attracted a lot of research interest in recent years due to its bio-interpretable[40] and event-triggered properties. When executing SNNs on neuromorphic chips, the computation skips weight calculations corresponding to the spike ”0” signal, requiring only accumulation of weights corresponding to the spike ”1” signal[27, 28]. This significantly reduces power consumption compared to artificial neural networks (ANNs). Therefore, SNNs hold more potential to replicate the efficiency advantages of the human brain. In particular, the unique spatio-temporal perceptual properties of SNNs[32] give them intrinsic potential in complex pattern recognition tasks oriented to neuromorphic datasets. However, training SNNs for comparable performance on neuromorphic datasets remains challenging due to the gap between the expressive ability of 0/1 spiking signals and that of floating-point number signals in ANNs. To this end, some advanced works have considered and tried from different perspectives, including designing the residual connection structure of SNN[11, 15, 31], introducing the attention mechanism into SNN[39, 1], and designing spiking neurons with richer neuronal dynamics[12, 9]. Nevertheless, these efforts predominantly concentrate on architectural refinements and overlook potential solutions at the training method of SNNs.

Unsupervised representation learning has demonstrated notable advancements in ANNs, particularly in natural language processing (NLP), exemplified by GPT[2] and BERT[16]. This is because compared with supervised learning, unsupervised learning can use unlabeled large-scale data for training, learn the potential structure and pattern in the data, and thus improve the expression ability and generalization ability of the model. However, in computer vision (CV), supervised learning remains predominant due to the less discrete nature of visual signal space compared to language tasks. Self-supervised learning(SSL), a variant of unsupervised learning, leverages inherent data properties as a form of supervised signal for training. To effectively apply SSL to CV tasks, researchers have undertaken numerous investigations into image self-supervised representation learning, yielding remarkable outcomes. For instance, in some detection and segmentation tasks, MoCo[14] surpassed their supervised counterparts; SimCLR[5] has further narrowed the gap between unsupervised and supervised pre-training; and DINO[3] introduced contrastive learning and knowledge distillation into SSL, resulting in significant performance enhancements.

In SNNs, except for SpikeGPT[44] in NLP tasks, and MAE[13] used in Spikformer V2[42], the vast majority of works adopted supervised learning training methods, which means that there is a gap in SSL methods for CV tasks of SNN. Hence, it is a promising opportunity to leverage the advantages demonstrated by SSL in ANNs to enhance the potential of SNNs in addressing challenges associated with complex neuromorphic datasets (i.e. datasets collected by event-based dynamic vision sensors). The development of DVS dynamic image (neuromorphic data collected by dynamic vision sensor cameras) classification not only provides new insights and technical support for the advancement of intelligent perception systems[38] but also holds extensive application prospects in fields such as autonomous driving[4] and drone navigation[29].

In this paper, based on MoCo paradigm[14, 6, 7], a SNN-oriented Neuromorphic momentum contrast learning method (NeuroMoCo) is proposed to enhance the accuracy of DVS dynamic image classification, which can be used as a self-supervised pre-training framework for spiking convolution and spiking Transformer structures. Here, data augmentation techniques in NDA[23] are improved to effectively enhance the diversity of positive and negative samples. At the same time, a new loss function named MixInfoNCE is designed by timing characteristics to increase the classification accuracy of neuromorphic datasets. In the end, to validate the NeuroMoCo of this paper, rich experiments are conducted on mainstream neuromorphic datasets DVS-CIFAR10, DVS128Gesture, and N-Caltech101. The experimental results show that:1) For DVS-CIFAR10, DVS128Gesture, and N-Caltech101, integrating with our NeuroMoCo framework of this paper, SEW-ResNet-18 and Spikformer-2-256 can respectively attain 81.50%, 97.92%, 84.35% and 83.60%, 98.62%, 81.62% classifition accuracy, surpassing those achieved through random initialization training; 2) The ablation experiments of the loss function demonstrate the viability and efficacy of MixInfoNCE loss function; 3) Compared with current leading methodologies, our NeuroMoCo approach establishs state-of-the-art (SOTA) benchmarks on DVS-CIFAR10, DVS128Gesture, and N-Caltech101. The contributions can be summarized as follows:

  • 1.

    We construct a dynamic dictionary, which integrates an automatically updating queue mechanism and an encoder based on momentum sliding average optimization. Based on this, we propose a SNN-oriented neuromorphic momentum contrast learning method (NeuroMoCo) to pretrain SNN model.

  • 2.

    We present a pre-processing method for neuromorphic datasets, and improve the data augmentation technique in NDA[23] to effectively enhance the diversity of positive and negative samples.

  • 3.

    According to the timing characteristics of neuromorphic datasets, we design a new loss function named MixInfoNCE, which is verified through rigorous ablation experiments.

  • 4.

    We conduct experiments on DVS-CIFAR10, DVS128Gesture, and N-Caltech101 to test the advances of NeuroMoCo. It is worth mentioning that compared with current leading methodologies, our NeuroMoCo approach establishs state-of-the-art (SOTA) benchmarks on each datasets, denoted as 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256) and 84.4% (SEW-ResNet-18), respectively.

The remainder of this article is structured as follows: Section 2 provides a concise overview of relevant prior work. In Section 3, we present the proposed method with details. Section 4 delineates the experimental methodologies employed to assess the efficacy of our proposed method and the associated loss function, and presents the ensuing experimental results. Finally, in Section 5, we draw conclusions grounded in our research findings.

2 Related Work

2.1 Spiking Neural Networks (SNNs)

SNNs are widely regarded as the third generation of neural networks, succeeding the McCulloch-Pitt perceptron and ANNs, primarily owing to their inherent biological plausibility[34]. Unlike traditional ANNs, which rely on continuous floating-point numbers to encode information, neurons in SNNs (LIF neurons[35], PLIF neurons[12] etc.) encode continuous input signals into binary spike sequences (0/1) through neuronal firing, which makes SNNs use discrete and sparse spike sequences to characterize information. Furthermore, SNNs introduce a temporal dimension, according with the special space-time characteristics of biological organism. The prevailing direct training[36] is adopted to construct SNNs in this study, as our task necessitates direct pre-training and fine-tuning of the SNN model. While numerous effective models and techniques have been successfully used in SNNs[33, 24, 18], the domain of SSL within SNNs remains relatively underexplored. In this investigation, our objective is to investigate the viability and efficacy of transferring self-supervised pre-training methodologies to CV tasks in SNNs. Such an exploration seeks to enhance the representational capabilities of SNNs on intricate neuromorphic datasets.

2.2 Self-Supervised Learning (SSL)

SSL has demonstrated its effectiveness in CV and NLP tasks in ANNs. This learning paradigm encompasses two primary branches: masked autoencoder[13] and contrastive learning[21]. The former reconstructs masked input to learn feature representations, while the latter enhances generalization and performance by comparing similarities and differences among sample pairs. Recent advancements include MoCo series[14, 6, 7], which introduce contrastive learning to computer vision using a momentum update strategy, and DINO[3], which leverages a Transformer architecture and self-supervised self-representation learning to maximize mutual information in image feature representations, enriching feature representation. Here, we propose to explore self-supervised momentum contrastive learning in SNNs for the first time.

3 Method

Towards neuromorphic datasets, a self-supervised pre-training framework named NeuroMoCo is specifically designed for SNNs, capable of accommodating both spiking convolution and spiking Transformer architectures, thereby enhancing the expressive ability and generalization ability of SNN models. The details of NeuroMoCo are despited as follows.

3.1 Overall Architecture

Regarding NeuroMoCo, we will present the details following the ”from pre-training to fine-tuning” paradigm. In the pre-training phase (figure 1, left), the input xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of the master encoder (M-Encoder) and the input xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the subordinative encoder (S-Encoder) are obtained by processing one data sampled from the neuromorphic dataset using two randomly different data augmentation methods. It is important to note that the neuromorphic dataset inherently possesses a temporal dimension, which we represent as T. The neuromorphic dataset during sampling is obtained after preprocessing the original neuromorphic data, and the specific preprocessing method is described in 3.2.

Refer to caption
Figure 1: Overview of NeuroMoCo and subsequent fine-tune. The NeuroMoCo includes an automatically updating queue and the S-Encoder based on momentum sliding average optimization. xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are obtained by processing one data sampled from the neuromorphic dataset using two randomly different data augmentation methods. T represents the time dimension of DVS data.

Given the input xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, M-Encoder encodes it to contextualized vector representations q, while the S-Encoder, with the same structure and initialization as the M-Encoder, encodes the input xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into a contextualized vector representation k. k is homologous to q and thus serves as a positive sample of q. Concurrently, contrastive learning necessitates assessing the similarity between positive and negative samples, and acquiring good features relies on abundant negative samples. To this end, we constructed a pool of negative samples for q utilizing a dynamic queue, denoted as {k0, k1, k2,…}. First of all, we initialize an empty queue of fixed size. At the beginning of training, the queue lacks sufficient negative samples for comparison with the positive samples, hence, we commence with random negative pairs.

Subsequently, after processing each sample, the feature representation of that sample is appended to the queue, while the earliest feature representation is evicted from the queue, thereby maintaining a constant queue size. Such a dynamic update method is based on the fact that the earliest feature representation in the queue is the most outdated and the most inconsistent with the feature representation obtained by the latest encoding. Consequently, as training proceeds, the feature representations in the queue undergo continual updates, thus constituting a feature pool comprising a specific number of negative samples. Certainly, in implementation, the process operates in batches, implying that a batch of size samples is processed at a time. Subsequent to acquiring the negative sample pool, the positive sample k and the dynamic queue of negative samples are concatenated to form the positive and negative sample space of q, denoted as K={k, k0, k1, k2,…}. It is important to emphasize that q and the samples within K remain temporal samples, with T still representing their time dimension.

On this basis, the contrastive loss function is used to represent the difference between q and its positive and negative samples, and the value of function is small when q is similar to its positive sample but not similar to the remaining negative samples (see 3.4 for the contrastive loss function). Then, the contrastive loss was gradually reduced based on gradient descent and back propagation algorithm, and the parameters of the M-Encoder were updated; for S-Encoder, we adopted a momentum update strategy because the negative samples processed by S-Encoder came from several previous small batches. The strategy can be formulated as follows:

θ𝐤subscript𝜃𝐤\displaystyle\mathbf{\theta_{k}}italic_θ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT =mθ𝐤+(1m)θ𝐪absent𝑚subscript𝜃𝐤1𝑚subscript𝜃𝐪\displaystyle=m\mathbf{\theta_{k}}+(1-m)\mathbf{\theta_{q}}= italic_m italic_θ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT + ( 1 - italic_m ) italic_θ start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT (1)

where θ𝐪subscript𝜃𝐪\mathbf{\theta_{q}}italic_θ start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and θ𝐤subscript𝜃𝐤\mathbf{\theta_{k}}italic_θ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT denote the parameters of the M-Encoder and S-Encoder, respectively, and m\in[0,1) signifies the momentum update coefficient. The utilization of momentum update enables the S-Encoder to maintain the direction consistency of parameter updating, thereby leveraging historical information more effectively and acquiring a more discriminative feature representation.

After pre-training with our NeuroMoCo method, the S-Encoder replaces the Head network behind Backbone network with a classification Head network in the fine-tuning stage (figure 1, right), so as to conduct subsequent training and testing of classification tasks on specific neuromorphic datasets. It should be noted that the treatment of T when calculating the loss in this stage follows the MAC paradigm consistent with the contrastive loss function, as detailed in 3.4.

3.2 Neuromorphic Data Preprocess

Refer to caption
Figure 2: Collection and preprocessing of neuromorphic data. The DVS camera collects sparse event data, which are integrated into time frames and stored in a large multi-dimensional tensor according to time series.

In this paper, the neuromorphic datasets we use are all event-based datasets collected by Dynamic Vision Sensor (DVS) cameras, which we Uniformly name as DVS data. Taking a chair image in the N-Caltech101 dataset as an example, the collection process of DVS data is shown in Figure 2 (a), which makes the static RGB image move along a certain trajectory, and the DVS camera captures this process and outputs the event stream data. When employing a DVS camera for data collection, each pixel operates independently and asynchronously. When the pixel is changed, it will show a positive active state, otherwise it will keep a negative silent state. Consequently, the resulting event stream data (as shown in Figure 2 (b)) manifests as time sequential and sparse. Due to it is too sparse, direct feature extraction becomes exceedingly challenging, necessitating preprocessing of neuromorphic data.

As mentioned above, the event stream data output by DVS camera is time-series. Therefore, we first divide the sparse event stream data into time Windows according to the time sequence (in Figure 2 (b), it is divided into four time windows T1, T2, T3, T4). Then the events are grouped according to the time window. For each time window, we will find all the events that fall within the time window and group them into a list. For the list of events in each time window, we need to integrate these events into the corresponding time frame (Figure 2 (c)),which is implemented by counting the number, spatial distribution and polarity of events in the time window. Different polar events are stored in different channels in the time frame. Finally, we stored the integrated time frames in each time window in a large multi-dimensional tensor according to the time sequence to obtain the preprocessed neuromorphic data (Figure 2 (d)).

3.3 Backbone Network

To bolster the credibility of our final conclusion, we opt for widely recognized convolutional architecture and Transformer architecture as backbones when employing provided NeuroMoCo for pre-training.

Refer to caption
Figure 3: Overview of Spike-Element-Wise block. We use ADD as the element-wise function (g) and substitute the original PLIF neurons with more versatile LIF neurons.

Specifically, for the convolutional architecture, following SEW-ResNet[11], we constructed the SEW-RESNET-18 model as the Backbone using Spike-Element-Wise blocks (Figure 3 (a)). Notably, as depicted in Figure 3 (b), within the Spike-Element-Wise blocks, we directly employ ADD as the primary element-wise function (g). Additionally, to mitigate the specific effects of neurons and promote universality, we substituted the original PLIF neurons with more versatile LIF neurons. Consequently, we establish the Backbone of the convolutional architecture on this foundation.

For the Backbone of Transformer architecture, we construct the Spikformer-2-256 model, building upon Spikformer[43]. This means that two spikformer encoder blocks are included and the feature embedding dimensions are 256. Furthermore, the patch size is set at 16 × 16, and the number of heads in the Spiking Self-Attention (SSA) module is uniformly configured to 16. It is imperative to underscore that, given the nature of neuromorphic data, the input comprises two channels representing the positive and negative polarities of the data.

3.4 Contrastive Loss

The contrastive loss function measures the similarity of pairs of samples in a representation space, and its value needs to be smaller when the positive pairs exhibit higher similarity while the negative pairs display lower similarity. InfoNCE[26] is a more mainstream form of contrastive loss function. It obtains the similarity matrix through dot product operation, so as to measure the similarity between sample pairs. Based on InfoNCE and the unique time series characteristics of neuromorphic data, a contrastive loss function named MixInfoNCE is designed in this paper.

Refer to caption
Figure 4: The principle diagram of MixInfoNCE. Following InfoNCE, we obtain the similarity matrix with time dimension T. For T, our MixInfoNCE adopts the strategy of MBC&MAC mixed paradigm.

As shown in Figure 4 (a), q(T,N,C) and k(T,N,C) are the vector representations output by M-Encoder and S-Encoder, respectively. k is the positive sample of q. queue(T,L,C) is a negative samples queue of q composed of k stored in history, where T represents the unique time dimension of neuromorphic data, N represents the number of samples processed in batches, C represents the number of vector-encoded channels, and L represents the fixed length of the queue. The overarching objective is to render positive sample pairs entirely similar and negative sample pairs entirely dissimilar. In constructing the labels, positions corresponding to the positive similarity matrix are set to 1, while positions corresponding to the negative similarity matrix are set to 0. Subsequently, the cross-entropy Loss (CE Loss) between the similarity matrix and the label is computed. This process can be formalized as follows:

q-InfoNCEsubscriptq-InfoNCE\displaystyle\mathcal{L}_{\text{q-InfoNCE}}caligraphic_L start_POSTSUBSCRIPT q-InfoNCE end_POSTSUBSCRIPT =lCE(SimilarityMatrix,ygt)absentsubscript𝑙𝐶𝐸𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦𝑀𝑎𝑡𝑟𝑖𝑥subscript𝑦𝑔𝑡\displaystyle=l_{CE}(SimilarityMatrix,y_{gt})= italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_S italic_i italic_m italic_i italic_l italic_a italic_r italic_i italic_t italic_y italic_M italic_a italic_t italic_r italic_i italic_x , italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT )
=logexp(qk/τ)exp(qk/τ)+i=0K1exp(qki/τ)absent𝑞𝑘𝜏𝑞𝑘𝜏superscriptsubscript𝑖0𝐾1𝑞subscript𝑘𝑖𝜏\displaystyle=-\log\frac{\exp(q\cdot k/\tau)}{\exp(q\cdot k/\tau)+\sum_{i=0}^{% K-1}\exp(q\cdot k_{i}/\tau)}= - roman_log divide start_ARG roman_exp ( italic_q ⋅ italic_k / italic_τ ) end_ARG start_ARG roman_exp ( italic_q ⋅ italic_k / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT roman_exp ( italic_q ⋅ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_τ ) end_ARG (2)

where τ𝜏\tauitalic_τ denotes the temperature coefficient hyperparameter and ygtsubscript𝑦𝑔𝑡y_{gt}italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT represents the true label.

However, the conventional InfoNCE does not inherently incorporate the time dimension in its computation. Thus, it becomes necessary to handle the time dimension specific to neuromorphic data separately. In SNN, the loss function typically follows the ”mean before criterion” (MBC) paradigm, as illustrated in Figure 4 (b), that is, the mean operation is used to eliminate the time dimension before calculating the difference between the prediction result and the label. It can be expressed as follows:

lossMBC=lCE(mean(fSNN(x(t))),ygt)subscriptloss𝑀𝐵𝐶subscript𝑙𝐶𝐸𝑚𝑒𝑎𝑛subscript𝑓𝑆𝑁𝑁𝑥𝑡subscript𝑦𝑔𝑡\displaystyle\mathrm{\textit{loss}}_{MBC}=l_{CE}(mean(f_{SNN}(x(t))),y_{gt})loss start_POSTSUBSCRIPT italic_M italic_B italic_C end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_m italic_e italic_a italic_n ( italic_f start_POSTSUBSCRIPT italic_S italic_N italic_N end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) ) , italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) (3)

where x(t)𝑥𝑡x(t)italic_x ( italic_t ) represents the time frame input of SNN, fSNN(·)subscript𝑓𝑆𝑁𝑁·f_{SNN}(\textbf{\textperiodcentered})italic_f start_POSTSUBSCRIPT italic_S italic_N italic_N end_POSTSUBSCRIPT ( · ) signifies the SNN encoding operation, mean(·)𝑚𝑒𝑎𝑛·mean(\textbf{\textperiodcentered})italic_m italic_e italic_a italic_n ( · ) denotes the averaging across the time dimension, and lCE(·)subscript𝑙𝐶𝐸·l_{CE}(\textbf{\textperiodcentered})italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( · ) represents the computation of cross-entropy loss. Moreover, drawing inspiration from [10], we introduce a novel paradigm for the loss function termed ”mean after criterion” (MAC), as depicted in Figure 4 (c). In this paradigm, the time dimension is integrated into the calculation of the disparity between the predicted result and the label. It can be expressed as follows:

lossMAC=mean(lCE(fSNN(x(t)),ygt)).subscriptloss𝑀𝐴𝐶𝑚𝑒𝑎𝑛subscript𝑙𝐶𝐸subscript𝑓𝑆𝑁𝑁𝑥𝑡subscript𝑦𝑔𝑡\displaystyle\mathrm{\textit{loss}}_{MAC}=mean(l_{CE}(f_{SNN}(x(t)),y_{gt})).loss start_POSTSUBSCRIPT italic_M italic_A italic_C end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S italic_N italic_N end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) , italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) ) . (4)

Finally, we aim to achieve the interaction of local and global information from the time dimension for superior performance. To this end, we devised the MixInfoNCE, formulated as follows:

MixInfoNCEsubscriptMixInfoNCE\displaystyle\mathcal{L}_{\text{MixInfoNCE}}caligraphic_L start_POSTSUBSCRIPT MixInfoNCE end_POSTSUBSCRIPT =αlossMBC+βlossMACabsent𝛼subscriptloss𝑀𝐵𝐶𝛽subscriptloss𝑀𝐴𝐶\displaystyle=\alpha\textit{loss}_{MBC}+\beta\textit{loss}_{MAC}= italic_α loss start_POSTSUBSCRIPT italic_M italic_B italic_C end_POSTSUBSCRIPT + italic_β loss start_POSTSUBSCRIPT italic_M italic_A italic_C end_POSTSUBSCRIPT
=αlCE(mean(SimilarityMatrix,ygt)\displaystyle=\alpha l_{CE}(mean(SimilarityMatrix,y_{gt})= italic_α italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_m italic_e italic_a italic_n ( italic_S italic_i italic_m italic_i italic_l italic_a italic_r italic_i italic_t italic_y italic_M italic_a italic_t italic_r italic_i italic_x , italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT )
+βmean(lCE(SimilarityMatrix,ygt))𝛽𝑚𝑒𝑎𝑛subscript𝑙𝐶𝐸𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦𝑀𝑎𝑡𝑟𝑖𝑥subscript𝑦𝑔𝑡\displaystyle+\beta mean(l_{CE}(SimilarityMatrix,y_{gt}))+ italic_β italic_m italic_e italic_a italic_n ( italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_S italic_i italic_m italic_i italic_l italic_a italic_r italic_i italic_t italic_y italic_M italic_a italic_t italic_r italic_i italic_x , italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) ) (5)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters, with α+β=1𝛼𝛽1\alpha+\beta=1italic_α + italic_β = 1. In the experimental section, we assess its effectiveness through ablation experiments.

4 Experiments and Analysis

Experiments and analysis are given in this section to visually verify the above methodology, which is divided into two parts: experimental details and experimental results.

4.1 Experimental Details

In this part, we will elaborate on the experimental methods and details, focusing on aspects such as datasets, experimental settings, and more.

4.1.1 Datasets

This paper targets event-based neuromorphic datasets, hence we select three commonly utilized DVS datasets: DVS-CIFAR10, DVS128Gesture, and N-Caltech101.

DVS-CIFAR10 is a neuromorphic dataset based on CIFAR10, wherein pulse events of image samples are captured utilizing a DVS camera. This dataset comprises 9,000 training samples and 1,000 testing samples.

DVS128Gesture is a gesture recognition dataset collected by a DVS128 dynamic vision sensor, covering 11 distinct gesture categories performed by 29 participants across 3 varying lighting conditions.

Akin to DVS-CIFAR10, the N-Caltech101 dataset is an extension of the Caltech101 static image dataset. Caltech101, renowned as a classic image classification dataset, encompasses image samples across 101 object categories, with each category containing approximately 50 to 800 image samples.

These datasets are captured by dynamic vision sensor cameras and exhibit characteristics such as time sequence and high sparsity.

4.1.2 DVS Data Augmentation

During pre-training with our NeuroMoCo method, data augmentation on the DVS data is necessary to construct positive-negative sample pairs. It has been demonstrated in NDA that pixel value-based augmentation is not suitable for DVS data. Consequently, our DVS data augmentation method relies on geometric-based augmentation.

In general, building upon the DVS data augmentation method employed by Spikformer[43], we make specific enhancements to suit our requirements. Particularly, considering that the resolution of the data in N-Caltech101 differs from other datasets, and ensuring consistency in resolution across all data is essential for comparing positive and negative samples pairs. Therefore, we employ Resize to augment the data of N-Caltech101. Additionally, we introduce vertical shear transformation (ShearY) and random horizontal flip to further enrich the data augmentation strategy, ensuring its comprehensiveness.

Table 1: Parameters of Pre-Training and Fine-Tuning Phase
time stepupdate(m)momentumbatch sizeepochlearningrate160.999322000.0316-161000.001Pre-TrainFine-TunePhaseSetup

4.1.3 Pre-Training and Fine-Tuning Setup

We utilize NeuroMoCo to pretrain the Backbone networks on the synthetic neuromorphic dataset, which is composed of CIFAR10, N-Caltech101, and DVS128Gesture. It is important to note that our self-supervised pre-training does not rely on labels. Subsequently, we append a classification head to the Backbone and conduct supervised fine-tuning training and testing on CIFAR10, N-Caltech101, and DVS128Gesture, respectively. During pre-training, owing to task requirements, we only retain two random views obtained from data augmentation, while in the fine-tuning phase, all data augmented views are preserved. Throughout our experiments, we maintain a uniform resolution of 128×128. The hyperparameter settings during the experiment are detailed in Table 1.

For pre-training, we choose Stochastic Gradient Descent (SGD) with weight decay of 1e-4 and momentum of 0.9 for optimization, and adopt MultiStepLR strategy for learning rate scheduling. In the fine-tuning stage, we use AdamW with a weight decay of 0.06 for optimization, and the learning rate first warms up for 30 epochs and then decays following the CosineAnnealingLR strategy.

4.2 Experimental Results

In this subsection, we commence by conducting ablation experiments to probe the effectiveness of the designed loss function and the proposed NeuroMoCo. Subsequently, the model performance is evaluated on DVS-CIFAR10, DVS128Gesture as well as N-Caltech101, and compared with some SOTA works in related fields to highlight the effect of our NeuroMoCo method.

4.2.1 Ablation Experiment

The key improvement of MixInfoNCE lies in the modification of the paradigm of the original InfoNCE when computing the cross-entropy loss. To ascertain its effectiveness and advantages, we carry out extensive ablation experiments on the loss function across three neuromorphic datasets using models of two architectures: spike convolution and spike transformer. To maintain consistency with subsequent experiments, during the ablation experiment, we directly use the models SEW-ResNet-18 and Spikformer-2-256 employed in the fine-tuning stage , and the time step is also uniformly set to 16. Firstly, we use the CE loss of MBC paradigm as the loss function (lossMBCsubscriptloss𝑀𝐵𝐶\textit{loss}_{MBC}loss start_POSTSUBSCRIPT italic_M italic_B italic_C end_POSTSUBSCRIPT), and conduct direct training and testing on DVS-CIFAR10, DVS128Gesture and N-Caltech101, respectively, so as to obtain a set of benchmarks for ablation experiments. Then, under the same experimental setup, we replace the loss function with the CE loss of MBC and MAC mixed paradigm (αlossMBC+βlossMAC𝛼subscriptloss𝑀𝐵𝐶𝛽subscriptloss𝑀𝐴𝐶\alpha\textit{loss}_{MBC}+\beta\textit{loss}_{MAC}italic_α loss start_POSTSUBSCRIPT italic_M italic_B italic_C end_POSTSUBSCRIPT + italic_β loss start_POSTSUBSCRIPT italic_M italic_A italic_C end_POSTSUBSCRIPT, this paper only considers the case of α=β𝛼𝛽\alpha=\betaitalic_α = italic_β). The same experiments were performed again on each corresponding dataset.

Table 2: Comparison of MBC loss with MBC&MAC mixed loss
Models Datasets Mixed Loss Acc
SEW-ResNet-18 DVS-CIFAR10 78.00
81.00
DVS128Gesture 95.83
96.18
N-Caltech101 80.52
80.74
Spikformer-2-256 DVS-CIFAR10 80.90
81.60
DVS128Gesture 96.87
97.57
N-Caltech101 79.86
80.53
Table 3: Ablation study results on NeuroMoCo.* denotes self-implementation results by [43].
Models Datasets NeuroMoCo Acc
SEW-ResNet-18 DVS-CIFAR10 78.00
81.50
DVS128Gesture 95.83
97.92
N-Caltech101 80.52
84.35
Spikformer-2-256 DVS-CIFAR10 80.90
83.60
DVS128Gesture 96.87superscript96.8796.87^{*}96.87 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
98.62
N-Caltech101 79.86
81.62

The experimental results shown in Table 2 indicate that when CE loss of MBC and MAC mixed paradigm is used, the performances of SEW-ResNet-18 and Spikformer-2-256 on DVS-CIFAR10, DVS128Gesture, and N-Caltech101 surpass those achieved with the CE loss of MBC paradigm alone.

Likewise, in the ablation experiments regarding NeuroMoCo, as a set of baselines, We first perform a group of experiments using SEW-ResNet-18 and Spikformer-2-256 without NeuroMoCo across all three datasets. Then, we compare the effects of SEW-ResNet-18 and Spikformer-2-256 with and without the NeuroMoCo method on the three neuromorphic datasets, respectively. The experimental results are summarized in Table 3.

Table 4: Comparison of performance between our NeuroMoCo and current state-of-the-art (SOTA) methods on neuromorphic datasets.
Methods Spikes DVS-CIFAR10 DVS128Gesture N-Caltech101
T Step Acc T Step Acc T Step Acc
LIAF-Net[37] 10 70.4 60 97.6 - -
TA-SNN[39] 10 72.0 60 98.6 - -
ECSNet[8] - 72.7 - 98.6 - 69.3
Rollout[20] 48 66.8 240 97.2 - -
DECOLLE[17] - - 500 95.5 - -
tdBN[41] 10 67.8 40 96.9 - -
PLIF[12] 20 74.8 20 97.6 - -
SEW-ResNet[11] 16 74.4 16 97.9 - -
Dspike[22] 10 75.4 - - - -
SALT[19] 20 67.1 - - - -
DSR[25] 10 77.3 - - - -
mMND[30] - - - 98.0 - 71.2
Spikformer[43] 16 80.9 16 98.3 - -
SEW-ResNet-18(ours) 16 81.5 16 97.9 16 84.4
Spikformer-2-256(ours) 16 83.6 16 98.62 16 81.6

The experimental results show that both SEW-ResNet-18 and Spikformer-2-256 exhibit superior performance on DVS-CIFAR10, DVS128Gesture, and N-Caltech101 when employing NeuroMoCo compared to their counterparts without NeuroMoCo. This indicates that our proposed NeuroMoCo method is effective and structurally compatible for SNNs’ pre-training.

4.2.2 Comparative Experiment

In this section, we undertake a series of comparative experiments with the aim of validating the performance of our approach and accentuating its advantages.

The performance of provided NeuroMoCo is evaluated on DVS-CIFAR10, DVS128Gesture, and N-Caltech101 using SEW-ResNet-18 and Spikformer-2-256 models. Additionally, a variety of SOTA methods are compared. The classification performance of our NeuroMoCo and current SOTA methods is presented in Table 4, which shows that our NeuroMoCo obtains remarkable results across all three datasets.

Specifically, on DVS-CIFAR10, our SEW-ResNet-18 and Spikformer-2-256 models achieve classification accuracies of 81.5% and 83.6% using 16 time steps, respectively, surpassing the accuracies of 78.0% and 80.9% achieved by the same models when randomly initialized (as shown in Table 3). Furthermore, compared to the state-of-the-art Spikformer, our Spikformer-2-256 achieves a 2.7% improvement in accuracy while employing the same model and time step. It is worth noting that we outperform loss-based TET (83.2%), which is not included in the table as it is based on loss rather than network architecture. This underscores that we have achieved state-of-the-art (SOTA) performance.

For DVS128Gesture, our SEW-ResNet-18 and Spikformer-2-256 models use 16 time steps to achieve a classification accuracy of 97.9% and 98.62%, respectively, which is better than 95.8% and 96.8% for random initialization of the same model (see Table 3). In addition, compared to the previous state-of-the-art TA-SNN model (60 time steps, 98.6%), our Spikformer-2-256 uses fewer time steps (16) to achieve higher classification accuracy (98.62%), which is also the current optimal performance.

Finally, on N-Caltech101, still utilizing 16 time steps, we achieve classification accuracies of 84.4% and 81.6% for SEW-ResNet-18 and Spikformer-2-256, respectively. These accuracies outperform the results of 80.5% and 79.8% achieved by the same models with random initialization (as seen in Table 3). Notably, SEW-ResNet-18 (84.4%) using our NeuroMoCo method exhibits significant improvements of 15.1% and 13.2%, respectively, compared to the previously known advanced works ECSNet (69.3%) and mMND (71.2%). To the best of our knowledge, this also represents the current SOTA.

Our models consistently demonstrate performance advantages across all three neuromorphic datasets, underscoring the generality of our proposed NeuroMoCo method for neuromorphic data. This holds significant implications for numerous practical applications.

5 Conclusion

In this paper, a SNN-oriented learning method NeuroMoCo has been introduced to increase the performance on complex neuromorphic datasets, which can be used as a self-supervised pre-training framework for spiking convolution and spiking Transformer structures. This is the first instance of applying SSL based on momentum contrastive learning to SNNs. In order to further improve the classification accuracy, we have designed a new loss function named MixInfoNCE based on the temporal characteristics of neuromorphic datasets. The effectiveness of MixInfoNCE and NeuroMoCo have been validated by extensive ablation experiments. After pre-training by NeuroMoCo, Spikformer-2-256 has achieved SOTA performance on DVS-CIFAR10 (83.6%) and DVS128Gesture (98.62%), and SEW-ResNet-18 has achieved SOTA performance on N-Caltech101 (84.4%), which means that SSL is an effective solution for complex tasks in the field of SNNs to some extent.

Acknowledgment

This work was supported by Natural Science Foundation of Chongqing (Grant No. cstc2021jcyj-msxmX0565), Fundamental Research Funds for the Central Universities (Grant No. SWU021002), Project of Science and Technology Research Program of Chongqing Education Commission (Grant No. KJZD-K202100203), and National Natural Science Foundation of China (Grant Nos. U1804158, U20A20227).

References

  • Bernert and Yvert [2019] Bernert, M., Yvert, B., 2019. An attention-based spiking neural network for unsupervised spike-sorting. International journal of neural systems 29, 1850059.
  • Brown et al. [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.
  • Caron et al. [2021] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A., 2021. Emerging properties in self-supervised vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660.
  • Chen et al. [2020a] Chen, G., Cao, H., Conradt, J., Tang, H., Rohrbein, F., Knoll, A., 2020a. Event-based neuromorphic vision for autonomous driving: A paradigm shift for bio-inspired visual sensing and perception. IEEE Signal Processing Magazine 37, 34–49.
  • Chen et al. [2020b] Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020b. A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR. pp. 1597–1607.
  • Chen et al. [2020c] Chen, X., Fan, H., Girshick, R., He, K., 2020c. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 .
  • Chen et al. [2021] Chen, X., Xie, S., He, K., 2021. An empirical study of training self-supervised vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9640–9649.
  • Chen et al. [2022] Chen, Z., Wu, J., Hou, J., Li, L., Dong, W., Shi, G., 2022. Ecsnet: Spatio-temporal feature learning for event camera. IEEE Transactions on Circuits and Systems for Video Technology 33, 701–712.
  • Cheng et al. [2023] Cheng, X., Zhang, T., Jia, S., Xu, B., 2023. Meta neurons improve spiking neural networks for efficient spatio-temporal learning. Neurocomputing 531, 217–225.
  • Deng et al. [2022] Deng, S., Li, Y., Zhang, S., Gu, S., 2022. Temporal efficient training of spiking neural network via gradient re-weighting, in: International Conference on Learning Representations.
  • Fang et al. [2021a] Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., Tian, Y., 2021a. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems 34, 21056–21069.
  • Fang et al. [2021b] Fang, W., Yu, Z., Chen, Y., Masquelier, T., Huang, T., Tian, Y., 2021b. Incorporating learnable membrane time constant to enhance learning of spiking neural networks, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2661–2671.
  • He et al. [2022] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009.
  • He et al. [2020] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738.
  • Hu et al. [2024] Hu, Y., Deng, L., Wu, Y., Yao, M., Li, G., 2024. Advancing spiking neural networks toward deep residual learning. IEEE Transactions on Neural Networks and Learning Systems , 1–15doi:10.1109/TNNLS.2024.3355393.
  • Jacob et al. [2019] Jacob, D., Chang, M.W., Kenton, L., Toutanova, K., 2019. Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of NAACL-HLT, pp. 4171–4186.
  • Kaiser et al. [2020] Kaiser, J., Mostafa, H., Neftci, E., 2020. Synaptic plasticity dynamics for deep continuous local learning (decolle). Frontiers in Neuroscience 14, 515306.
  • Kim et al. [2022] Kim, Y., Chough, J., Panda, P., 2022. Beyond classification: Directly training spiking neural networks for semantic segmentation. Neuromorphic Computing and Engineering 2, 044015.
  • Kim and Panda [2021] Kim, Y., Panda, P., 2021. Optimizing deeper spiking neural networks for dynamic vision sensing. Neural Networks 144, 686–698.
  • Kugele et al. [2020] Kugele, A., Pfeil, T., Pfeiffer, M., Chicca, E., 2020. Efficient processing of spatio-temporal data streams with spiking neural networks. Frontiers in neuroscience 14, 512192.
  • Le-Khac et al. [2020] Le-Khac, P.H., Healy, G., Smeaton, A.F., 2020. Contrastive representation learning: A framework and review. Ieee Access 8, 193907–193934.
  • Li et al. [2021] Li, Y., Guo, Y., Zhang, S., Deng, S., Hai, Y., Gu, S., 2021. Differentiable spike: Rethinking gradient-descent for training spiking neural networks. Advances in Neural Information Processing Systems 34, 23426–23439.
  • Li et al. [2022] Li, Y., Kim, Y., Park, H., Geller, T., Panda, P., 2022. Neuromorphic data augmentation for training spiking neural networks, in: European Conference on Computer Vision, Springer. pp. 631–649.
  • Liao et al. [2024] Liao, Z., Liu, Y., Zheng, Q., Pan, G., 2024. Spiking nerf: Representing the real-world geometry by a discontinuous representation, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 13790–13798.
  • Meng et al. [2022] Meng, Q., Xiao, M., Yan, S., Wang, Y., Lin, Z., Luo, Z.Q., 2022. Training high-performance low-latency spiking neural networks by differentiation on spike representation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12444–12453.
  • Oord et al. [2018] Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 .
  • Pei et al. [2019] Pei, J., Deng, L., Song, S., Zhao, M., Zhang, Y., Wu, S., Wang, G., Zou, Z., Wu, Z., He, W., et al., 2019. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature 572, 106–111.
  • Roy et al. [2019] Roy, K., Jaiswal, A., Panda, P., 2019. Towards spike-based machine intelligence with neuromorphic computing. Nature 575, 607–617.
  • Salvatore et al. [2020] Salvatore, N., Mian, S., Abidi, C., George, A.D., 2020. A neuro-inspired approach to intelligent collision avoidance and navigation, in: 2020 AIAA/IEEE 39th Digital Avionics Systems Conference (DASC), IEEE. pp. 1–9.
  • She et al. [2021] She, X., Dash, S., Mukhopadhyay, S., 2021. Sequence approximation using feedforward spiking neural network for spatiotemporal learning: Theory and optimization methods, in: International Conference on Learning Representations.
  • Shen et al. [2024] Shen, H., Wang, H., Ma, Y., Li, L., Duan, S., Wen, S., 2024. Multi-lra: Multi logical residual architecture for spiking neural networks. Information Sciences 660, 120136.
  • Skatchkovsky et al. [2021] Skatchkovsky, N., Jang, H., Simeone, O., 2021. Spiking neural networks—part ii: Detecting spatio-temporal patterns. IEEE Communications Letters 25, 1741–1745.
  • Su et al. [2023] Su, Q., Chou, Y., Hu, Y., Li, J., Mei, S., Zhang, Z., Li, G., 2023. Deep directly-trained spiking neural networks for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6555–6565.
  • Taherkhani et al. [2020] Taherkhani, A., Belatreche, A., Li, Y., Cosma, G., Maguire, L.P., McGinnity, T.M., 2020. A review of learning in biologically plausible spiking neural networks. Neural Networks 122, 253–272.
  • Teeter et al. [2018] Teeter, C., Iyer, R., Menon, V., Gouwens, N., Feng, D., Berg, J., Szafer, A., Cain, N., Zeng, H., Hawrylycz, M., et al., 2018. Generalized leaky integrate-and-fire models classify multiple neuron types. Nature communications 9, 709.
  • Wu et al. [2019] Wu, Y., Deng, L., Li, G., Zhu, J., Xie, Y., Shi, L., 2019. Direct training for spiking neural networks: Faster, larger, better, in: Proceedings of the AAAI conference on artificial intelligence, pp. 1311–1318.
  • Wu et al. [2021] Wu, Z., Zhang, H., Lin, Y., Li, G., Wang, M., Tang, Y., 2021. Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing. IEEE Transactions on Neural Networks and Learning Systems 33, 6249–6262.
  • Yang et al. [2023] Yang, Y., Bartolozzi, C., Zhang, H.H., Nawrocki, R.A., 2023. Neuromorphic electronics for robotic perception, navigation and control: A survey. Engineering Applications of Artificial Intelligence 126, 106838.
  • Yao et al. [2021] Yao, M., Gao, H., Zhao, G., Wang, D., Lin, Y., Yang, Z., Li, G., 2021. Temporal-wise attention spiking neural networks for event streams classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10221–10230.
  • Zhao et al. [2022] Zhao, D., Li, Y., Zeng, Y., Wang, J., Zhang, Q., 2022. Spiking capsnet: A spiking neural network with a biologically plausible routing rule between capsules. Information Sciences 610, 1–13.
  • Zheng et al. [2021] Zheng, H., Wu, Y., Deng, L., Hu, Y., Li, G., 2021. Going deeper with directly-trained larger spiking neural networks, in: Proceedings of the AAAI conference on artificial intelligence, pp. 11062–11070.
  • Zhou et al. [2024] Zhou, Z., Che, K., Fang, W., Tian, K., Zhu, Y., Yan, S., Tian, Y., Yuan, L., 2024. Spikformer v2: Join the high accuracy club on imagenet with an snn ticket. arXiv preprint arXiv:2401.02020 .
  • Zhou et al. [2022] Zhou, Z., Zhu, Y., He, C., Wang, Y., Yan, S., Tian, Y., Yuan, L., 2022. Spikformer: When spiking neural network meets transformer. arXiv preprint arXiv:2209.15425 .
  • Zhu et al. [2023] Zhu, R.J., Zhao, Q., Li, G., Eshraghian, J.K., 2023. Spikegpt: Generative pre-trained language model with spiking neural networks. arXiv preprint arXiv:2302.13939 .