\fnref

cor

\fnref

cor \fntext[cor]Corresponding authors: Malu Zhang, Liang-Jian Deng

Tensor Decomposition Based Attention Module for Spiking Neural Networks

Haoyu Deng [email protected] Ruijie Zhu [email protected] Xuerui Qiu [email protected] Yule Duan [email protected] Malu Zhang [email protected] Liang-Jian Deng [email protected] University of Electronic Science and Technology of China, 611731, China University of California, Santa Cruz, 95064, The United States

Abstract

[SUMMARY] The attention mechanism has been proven to be an effective way to improve the performance of spiking neural networks (SNNs). However, from the perspective of tensor decomposition to examine the existing attention modules, we find that the rank of the attention maps generated by previous methods is fixed at 1, lacking the flexibility to adjust for specific tasks. To tackle this problem, we propose an attention module, namely Projected-full Attention (PFA), where the rank of the generated attention maps can be determined based on the characteristics of different tasks. Additionally, the parameter count of PFA grows linearly with the data scale. PFA is composed of the linear projection of spike tensor (LPST) module and attention map composing (AMC) module. In LPST, we start by compressing the original spike tensor into three projected tensors with learnable parameters for each dimension. Then, in AMC, we exploit the inverse procedure of the tensor decomposition process to combine the three tensors into the attention map using a so-called connecting factor. To validate the effectiveness of the proposed PFA module, we integrate it into the widely used VGG and ResNet architectures for classification tasks. Our method achieves state-of-the-art performance on both static and dynamic benchmark datasets, surpassing the existing SNN models with Transformer-based and CNN-based backbones. Code for PFA is available at https://github.com/RisingEntropy/PFA .

keywords:

Spiking neural network \sepAttention mechanism \sepTensor decomposition \sepNeuromorphic computing

1 Introduction

Spiking neural networks (SNNs) are currently attracting the interest of academics due to their lower energy consumption and greater bio-interpretability compared to traditional artificial neural networks (ANNs) [19; 36]. The brain-inspired computation model gives it significant potential in processing temporal data. Though training a deep SNN is still a challenge, recent developments in SNNs introduce backpropagation [3; 55; 65; 50; 53; 41] to relieve this issue to some degree. Moreover, it makes the extending of ANN modules into SNNs possible, such as batch normalization and residual blocks.

As it is now feasible to incorporate ANN modules into SNNs while maintaining the inherent power efficiency, we can leverage these modules to augment the overall performance of SNNs. A good practice for this is the introduction of residual architecture to SNNs. The well-known residual block proposed by He et al. [14] makes it possible to train an ultra-deep ANN network. To introduce the residual block into the field of SNN, many variations [17; 11; 18] are presented from different perspectives, solving problems like gradient vanishing. Moreover, strategies in ANN, such as normalization [67] and architecture, could also effectively favor SNNs’ performance, thereby motivating us to explore more useful techniques in ANNs for better-conducting SNNs.

Refer to caption — Figure 1: Accuracy on CIFAR10 (left) and CIFAR100 (right). Compared with other methods, PFA significantly improves network performance.

Apart from improving network architectures, numerous other approaches can improve the performance of a network. As one of the most promising techniques in ANNs, the attention mechanism can mimic the human ability to focus on certain things while ignoring others. Three representative works [62; 70; 4] have recently proven that the attention mechanism can successfully work in SNNs. Especially, Yao et al. [62] switch channel attention to the temporal dimension, revealing that paying attention to the temporal dimension is effective for performance enhancement. Zhu et al. [70] apply two 1D convolutions to a generated 2D tensor to achieve temporal-channel attention. Inspired by experimental observations on predictive attentional remap**, Cai et al. [4] design the SCTFA module to assess the input tensor from temporal, channel, and spatial dimensions, achieving attention among these three dimensions.

Current software frameworks, i.e., PyTorch and Tensorflow, are boosted by GPU using parallel computation. This makes it hard to realize event-driven SNNs as events occur asynchronously. Based on these frameworks, as a trade-off, a popular approach is splitting the input event streams into slices and composing tensors for convenient later computation. This approach is adopted by all attention-mechanism-related works, showing a promising outcome. Since we are essentially dealing with tensor-formatted data flow, it is natural for us to consider involving tensor analysis theories to implement an attention module for SNNs, which is ignored by previous work. By utilizing mathematical tools such as tensor decomposition theory, we can examine attention modules from a more abstract perspective, thereby identifying the shortcomings and limitations of previous methods.

By expressing the composition of attention maps in previous methods through CANDECOMP/PARAFAC (CP) decomposition form, we found that they are essentially ‘rank-1’ methods. When applied to various tasks, they lack a certain degree of flexibility. In this paper, we introduce a module called Projected-Full Attention (PFA), which is capable of generating attention maps with a rank other than 1. Specifically, PFA consists of two components: Linear Projection of Spike Tensor (LPST) and Attention Map Composing (AMC). LPST is responsible for generating $R$ sets of projections from the input tensor using a small number of parameters. AMC is responsible for creating attention maps using these $R$ sets of projections, where $R$ is called The connecting factor representing the rank concept of the CP decomposition and is a hyper-parameter that can be adjusted based on the specific task.

The key contributions of this paper are outlined as follows:

•

We propose PFA, a module that can achieve temporal-channel-spatial attention. In contrast to previous approaches, the ranks of the attention maps generated by PFA are not fixed at 1; instead, the rank can be chosen based on the specific task. Furthermore, the parameter count of PFA grows linearly with the data scale, and the computational cost is equivalent to that of a single standard convolution layer.
•

We present a comprehensive theoretical analysis, focusing on two critical aspects: the rank of tensors and $R$ , the connecting factor. These analyses lead to a tailored selection criterion, providing effective insights into experimental outcomes and offering a valuable guidance for future applications.
•

We conduct thorough experiments on both static and dynamic datasets. The results demonstrate the effectiveness of PFA which achieves state-of-the-art (SOTA) accuracy on both dynamic and static datasets. Ablation studies further validate the structural reasonableness of PFA. Additionally, we visualize the attention map generated by PFA to intuitively illustrate attention distribution across temporal, channel, and spatial dimensions.

2 Related Works

Spiking Neural Network: Spiking Neural Networks (SNNs), the third generation of neural networks [19; 36], offer a closer emulation of the human brain’s efficiency by utilizing discrete spikes for information transmission [64]. This unique method of communication allows for enhanced energy-saving feature compared to traditional Artificial Neural Networks (ANNs), which rely on continuous signals. The temporal representation capabilities of SNNs have been a focal point in research [62; 8; 64], drawing from seminal neurodynamic models such as the Hodgkin-Huxley (H-H) model [15] and further developed in works like Izhikevich [19] and Leaky Integrate and Fire (LIF) [31]. These models underscore SNNs’ potential in capturing time-dependent patterns in data, a feature not inherently present in ANNs. The challenges in training SNNs, due to their non-differentiable spiking behavior, have also been addressed in recent literature, with gradient-based optimization methods [54; 56; 51] and ANN-to-SNN conversion techniques [45; 9; 49; 52] being key developments. In addition to research on training methods, there are many other studies dedicated to improving the performance of SNNs, such as attention mechanisms.

Attention Mechanism in SNNs: Since Google proposed recurrent models of visual attention in 2014 [38], the attention mechanism become a potent tool to increase the performance of a neural network, which gives distinct parts of the input data varied weights. For instance, the squeeze-and-excitation network (SENet) [16] gives different weights to different channels of input data and significantly improves the performance. From this fact, it is of vital significance to introduce the attention mechanism into SNNs. Qiu et al. apply the attention mechanism to the encoding layers of SNNs [42]. To achieve temporal-wise attention, inspired by SENet, TA-SNN [62] is proposed. Besides, TCJA [70] uses two 1D convolutions to pay attention to both channel dimension and temporal dimension to prolong the attention mechanism. Nevertheless, they fail to consider three crucial aspects of SNNs, i.e. temporal, channel, and spatial. By changing SENet and leaving affection in the temporal direction, Cai et al. [4] and Yao et al. [63] successfully apply temporal-channel-spatial attention to SNNs. While the methods of predecessors have been very successful, they do not take into account the fact that the input data is a high-order tensor. Examining data from a tensor perspective can provide a more mathematically abstract viewpoint, revealing additional properties. As described in Section 3.1, the methods employed by previous researchers result in ‘rank-1’ attention maps, lacking the flexibility and specificity for different tasks.

Tensor Decomposition: Tensor decomposition methods [20; 46; 25] have undergone years of development and achieved remarkable success. Many prior studies [28; 61; 39; 26; 2; 35; 48] have embraced these techniques to optimize neural network architectures or uncover versatile modules with multiple functionalities. Kossaifi et al., for instance, conducted a comprehensive review of convolution kernels from the perspective of tensor decomposition and introduced an innovative convolution module that can seamlessly extend to higher dimensions. Lau et al. propose a new decomposition method to enable the direct use of the depth-wise convolutional layer with large kernels in the attention module, without requiring any extra blocks [32]. Apart from using tensor decomposition methods to improve network structures, tensor decomposition theory can also impose certain constraints on the data itself [58; 60; 5]. Xu et al. leverage tensor decomposition to predict real-time traffic flow, demonstrating its effectiveness in handling complex patterns [60]. Chen et al. employ tensor CP decomposition to generate an attention map for segmentation, achieving significant improvements [5]. These methods either use tensor decomposition to reduce the parameter count in processing high-dimensional data or leverage low-rank properties to enhance performance. In this paper, drawing inspiration from these prior endeavors, we leverage tensor decomposition methods to enhance the field of SNNs. We design a lightweight attention module, designed to be both parameter-efficient and computationally efficient.

3 Motivation and Method

3.1 Motivation

Tensor CP decomposition [24], is a mathematical technique employed to decompose a high-order tensor into a sum of rank-one tensors. Mathematically, for a tensor denoted as $\mathcal{X}\in\mathbb{R}^{D_{1},D_{2},\dots,D_{n}}$ , the CP decomposition can be expressed as:

\mathcal{X}\approx\sum_{r=1}^{R}U_{1,r}^{(D_{1})}\circ U_{2,r}^{(D_{2})}\circ% \dots\circ U_{n,r}^{(D_{n})}.

(1)

In this equation, $U_{1,r}^{(D_{1})}\circ U_{2,r}^{(D_{2})}\circ\dots\circ U_{n,r}^{(D_{n})}$ represents the $r$ -th rank-one tensor, where the symbol $\circ$ signifies the outer product of vectors, and $R$ denotes the rank. Specifically, $U_{n,r}^{(D_{1})}$ represents a vector corresponding to the $n$ -th dimension in the $r$ -th rank-one tensor, with a dimensionality of $D_{1}$ .

Rethinking previous attention modules [62; 70; 4] from the perspective of CP decomposition, we can find that they are all rank-one methods. For the sake of clarity, we denote an all-one vector with a dimensionality of $D$ as $I^{(D)}$ . Specifically, for TA-SNN [62], the temporal attention module for SNNs yields the attention map $\mathcal{A}_{\text{TA-SNN}}$ for the input tensor $\mathcal{X}$ as follows:

\mathcal{A}_{\text{TA-SNN}}=I^{(HW)}_{S}\circ I^{(C)}\circ\mathcal{F}_{T}^{(T)% }(\mathcal{X}).

(2)

Here, $\mathcal{F}_{T}^{(T)}(\mathcal{X})$ represents a function that generates the temporal attention vector with a dimensionality of $T$ . The same mathematical form can also be used to reconfigure the TCJA [70] and SCTFA [4] method. For TCJA, the attention map $\mathcal{A}_{\text{TCJA}}$ can be written as:

\mathcal{A}_{\text{TCJA}}=I^{(HW)}_{S}\circ\mathcal{F}^{(C)}_{C}(\mathcal{X})% \circ\mathcal{F}^{(T)}_{T}(\mathcal{X}).

(3)

For SCTFA, the attention map $\mathcal{A}_{\text{SCTFA}}$ is:

\mathcal{A}_{\text{SCTFA}}=\mathcal{F}^{(HW)}_{S}(\mathcal{X})\circ\mathcal{F}% ^{(C)}_{C}(\mathcal{X})\circ\mathcal{F}^{(T)}_{T}(\mathcal{X}).

(4)

By comparing Equation 1 and Equation 2, 3, 4, we can easily see that the $R$ of Equation 2, 3, 4 is fixed at 1, making them special cases of Equation 1. This motivates us to design an attention module that the rank of the attention map is not fixed at 1, allowing us to control the attention effect by adjusting the rank of the attention map; namely, we wish our proposed module could be written in a form like:

\mathcal{A}_{PFA}=\sum_{r=1}^{R}U^{(HW)}_{s,r}\circ U^{(C)}_{c,r}\circ U^{(T)}% _{t,r}.

(5)

Through this way, we can use $R$ to control certain properties of the attention map, such as low rank and the degree of compression of the original tensor. From the perspective of tensor decomposition, previous methods can be seen as special cases of our method when $R=1$ . However, due to the absence of summation terms in previous methods, adjustments tailored to the specificity of the dataset cannot be made, lacking a certain degree of flexibility. The flexibility mentioned is crucial, as demonstrated by the experiments presented in Section 4.4. The results indicate that the optimal choice for $R$ varies across different datasets.

3.2 Projected Full-Attention (PFA)

In this section, we will present an overview of our proposed PFA module and detailed descriptions of two sub-modules of PFA. PFA is composed of two sub-modules, Linear Projection of Spike Tensor (LPST) and Attention Map Composing (AMC). LPST, as its name suggests, produces projections from input tensor $\mathcal{X}$ for AMC in the form of matrices. The AMC module splits the matrices from LPST to vectors to compose the final attention map.

The core of the attention mechanism is to give essential parts of the input data higher weights while ignoring less-related parts to some degree. This is particularly important for SNNs, especially for data collected by DVS cameras. Data from DVS camera often contains more noise, which greatly affects network performance. Thanks to the feature of CP-decomposition, PFA can effectively suppress noise. Also, as shown in Figure 4, the inexact representation of the input tensor can help reduce overfitting, which is very severe in SNNs. In our proposed PFA, we first design the LPST module to project the higher-order tensor (generally 3D) to three lower-order tensors, i.e., 2D matrices. Using learnable parameters, LPST can selectively reserve essential information while ignoring irrelevant ones. In AMC, we split the matrices into $R$ (called connecting factor here) vectors. Three vectors from three matrices of temporal, channel, and spatial dimensions respectively are used to conduct the outer product to compose a part of the final attention map. The final attention map is the sum of these $R$ parts. We fuse this attention map with the origin input to obtain the refined tensor.

The workflow of PFA is plotted in Figure 3. In what follows, we will give the details of LPST and AMC.

3.2.1 Linear Projection of Spike Tensor (LPST)

Tensor decomposition is a vital approach to represent a tensor and its latent features. Therefore, it is natural for us to utilize tensor-decomposition-related techniques to explore the data characteristics. A typical technique is the CP decomposition. Whatever way we adopt, it is essential to generate a ‘projection’ to extract significant components of the input tensor to yield an attention map later.

In this part, we need to construct three linear projections of the input data stream along spatial, channel, and temporal dimensions. In LPST, we apply two fully connected (FC) layers and a convolution layer to project the input tensor for simplicity. Note that here we simply use single-layer FCs and a convolution layer. What we do here is just a simple projection with learnable parameters rather than complex function fitting. In this way, we retain the required information and reduce the number of parameters.

For the temporal dimension, we first squeeze the input tensor ( $H\times W\times C\times T$ ) in the following way:

Y^{T}_{(c,t)}=\frac{1}{HW}\sum\limits_{i=1}^{H}\sum\limits_{j=1}^{W}\mathcal{X% }_{(c,t,i,j)},

(6)

where $Y^{T}_{(c,t)}$ denotes the output matrix, $H$ and $W$ denote the spatial sizes, $\mathcal{X}_{(i,j,c,t)}$ denotes the input tensor. After the squeeze operation, the tensor is converted to a matrix in the shape of $C\times T$ . Then, we apply an FC layer to reserve $R$ necessary information in each column. A sigmoid activation function is applied after the FC layer. Finally, we obtain a projection matrix in the shape of $r\times T$ .

For the channel dimension, the operation is similar to that of the temporal dimension. We squeeze the input tensor by:

Y^{C}_{(c,t)}=\frac{1}{HW}\sum\limits_{i=1}^{H}\sum\limits_{j=1}^{W}\mathcal{X% }_{(c,t,i,j)},

(7)

where $Y^{C}_{(c,t)}$ denotes the output matrix, $C$ denotes the number of channel. Then, we pass the $T\times C$ matrix through an FC layer to acquire the projected matrix in the shape of $r\times T$ with the sigmoid activation function applied.

For spatial dimension, the channel is squeezed by:

Y^{S}_{(t,x,y)}=\frac{1}{C}\sum\limits_{k=1}^{C}\mathcal{X}_{(k,t,x,y)}.

(8)

Afterward, we employ a convolution to the squeezed tensor, treating $T$ as the input channel and $R$ as the output channel. Then, we reshape the output tensor from the convolution into a matrix in the shape of $HW\times r$ . The sigmoid activation function is applied.

3.2.2 Attention Map Composing (AMC)

Once we obtain the three projections corresponding to three dimensions, a necessary step is to compose an attention map using the three projections. Here, we adopt the reverse process of the tensor CP decomposition. In the CP decomposition, a tensor $\mathcal{X}\in\mathcal{R}^{HW\times C\times T}$ can be represented as:

\mathcal{A}_{s,c,t}=\sum_{r=1}^{R}U^{(HW)}_{s,r}\circ U^{(C)}_{c,r}\circ U^{(T% )}_{t,r},

(9)

where $U_{si}$ , $U_{ci}$ and $U_{ti}$ denote the $i$ -th vectors in the spatial, channel, and temporal projection tensors respectively, $\mathcal{A}$ is the attention map, and $\circ$ is the outer product of vectors. In particular, $R$ , the connecting factor, can be viewed as the rank concept in the tensor CP decomposition (this process is plotted in Figure 3). It is worth noting that the tensor CP decomposition is just an approximate representation of tensors, which perfectly meets our demands. $R$ can control the accuracy of our representation of the input tensor. That is, the larger $R$ is, the more accurate the representation is, and vice versa, which perfectly realizes the attention mechanism we need. In the experiment part, we discuss the choice of connecting factor $R$ and its effect in detail.

3.3 Parameter and Computational Cost Analysis

This section presents an analysis of the parameter and computational cost of PFA. PFA has an advantage in terms of parameter quantity. For an input tensor with the shape of $H\times W\times C\times T$ , the overall parameter amount is:

\underbrace{C\times R}_{\text{MLP of }Y^{T}}+\underbrace{T\times R}_{\text{MLP% of }Y^{C}}+\underbrace{k^{2}\times T\times R}_{\text{Conv of }Y^{S}},

(10)

where $k$ is the convolution kernel size. Note that $R$ is limited to a small fixed value according to our theoretical analysis in Section 3.4, and the parameter size grows linearly with $T$ and $C$ . Figure 2 shows the parameter growth curve of PFA and compares several other techniques.

PFA is also lightweight in computational burden. The time complexity of PFA can be calculated in the following way:

\underbrace{3HWTC}_{\text{Obtain }Y^{T},Y^{C},Y^{S}}+\underbrace{2TCR}_{\text{% MLP of }Y^{T},Y^{C}}+\underbrace{HWk^{2}TR}_{\text{Conv of }Y^{S}}+\underbrace% {RHWTC}_{\text{AMC}}.

(11)

It is worth noting that a standard convolution operation typically entails a computational burden of $HWk^{2}TC_{in}C_{out}$ . In this context, given that $R$ is not large, a PFA module incurs a computational load that is smaller than that of a convolution operation since $C$ is usually in the range of several hundred.

3.4 Theoretical Analysis on $R$

In light of the fact that CP decomposition provides only an approximate representation of a tensor, and the quality of this approximation relies on the chosen factor, denoted as $R$ , it is essential to establish a clear guiding principle for determining the optimal value of $R$ . This section delves into our approach for gaining insights into the characteristics of PFA and streamlining its future applications.

Our exploration begins by adopting the methodology put forth by Acar et al. [1] to investigate the rank of the input data. However, due to the recognized NP-hard nature of computing the CP decomposition of a tensor [13; 24], a direct calculation from the tensor becomes impractical. To surmount this challenge, we employ a gradient descent technique to approximate the original tensor by systematically experimenting with different rank values, ultimately selecting the one that yields the best approximation for the tensor. To provide further clarity, consider an element of the tensor represented as:

x_{ijk}\approx\sum\limits_{r=1}^{R}a_{i,r}b_{j,r}c_{k,r}.

(12)

We treat this as an optimization problem, with $a_{i,r}$ , $b_{j,r}$ , and $c_{k,r}$ considered as parameters. We then optimize the loss function defined as:

\mathcal{L}=\frac{1}{2}\sum_{(i,j,k)\in\Omega}(x_{i,j,k}-\sum\limits_{r=1}^{R}% a_{i,r}b_{j,r}c_{k,r})^{2}.

(13)

To perform optimization, we compute the first-order partial derivative of the loss with respect to a parameter (e.g., $a_{i,r}$ ), given by:

\frac{\partial\mathcal{L}}{\partial a_{i,r}}=-\frac{1}{2}\cdot 2\cdot(x_{i,j,k% }-\sum\limits_{r=1}^{R}a_{i,r}b_{j,r}c_{k,r})\cdot b_{j,r}c_{k,r}.

(14)

Finally, we apply the gradient descent method for optimization:

a_{i,r}^{\prime}\leftarrow a_{i,r}-\mu\cdot\frac{\partial\mathcal{L}}{\partial a% _{i,r}},

(15)

Here, $\mu$ represents the step size used in the gradient descent process. This approach allows us to effectively estimate the rank of the tensor while circumventing the computational complexities associated with CP decomposition.

We take a piece of data from the dataset CIFAR10DVS as a toy example and discover its rank. In this case, we adopt $\mu=0.0001$ and iterate the process 1000 times to get the final optimized parameters. We utilize $\ell_{2}$ norm as the metric to judge the approximation effect. As shown in Figure 5, the error reduces as the rank increases, and starts to converge when the rank exceeds 30. This toy example shows that the rank of the input tensor is around 30, indicating that we need a rank exceeding 30 to represent it using CP decomposition precisely. It should be noted that accurately representing the original tensor for the attention mechanism is not conducive to improving network performance. The core idea of the attention mechanism is to focus on the important parts while ignoring the secondaries. Representing raw tensors too precisely is counterproductive. On the one hand, overly precise characterization of input tensors is detrimental to neglecting unimportant parts. On the other hand, it may lead to severe overfitting. As shown in Figure 4, the vanilla VGG network exhibits severe overfitting, achieving nearly 100% accuracy on the training set but less than 80% on the validation set. However, upon incorporating the PFA modules, there is not a substantial improvement in training set accuracy. Nevertheless, there is a significant increase in validation set accuracy. This observation underscores the benefit of utilizing less precise tensor representations with PFA to mitigate overfitting. Therefore, a good choice for $R$ should not be too large. This is a very useful conclusion. On the one hand, it tells us the range of finding $R$ . On the other hand, it also ensures that PFA can obtain good results without excessive calculation.

As shown in our experiment (Figure 8), though the rank of the attention map differs on dynamic datasets like CIFAR10DVS and static datasets like CIFAR100, the best option for $R$ is not very big. Further observing the experimental results, we find that when the value of $R$ exceeds a certain threshold (approximately $\frac{T}{2}$ , $T$ is the time step), the larger $R$ is, the worse the effect will be. The above analysis of the experimental results proves the correctness of our conclusion. Considering the above analysis, we propose a principle for selecting $R$ : for dynamic datasets, the search for the optimal value of $R$ is performed within the range of no more than $T$ . Specifically, we focus on identifying the best $R$ value around $\frac{T}{2}$ , considering the varying content across different time steps. In the case of static datasets, the search for the optimal $R$ value begins from 1 and extends until the best value is determined. Since static datasets involve duplicated frames, this search aims to find the most suitable $R$ value for the low-rank condition.

4 Experiment

We evaluate the classification performance of our PFA on two static datasets: CIFAR10 [30], CIFAR100 [30] and two dynamic datasets: CIFAR10DVS [33], NCaltech-101 [40]. Details of the datasets, network architecture, data augmentation, loss function, and pre-process procedure is introduced in this section. Extensive experiment results are also presented.

Dataset	Learning Rate	Epoch	Batch Size
CIFAR10	0.1	200	128
CIFAR100	0.1	200	64
CIFAR10DVS	0.0001	200	32
NCaltech-101	0.0001	200	32

Table 1: Hyper-parameter settings of PFA.

4.1 Datasets and Training Details

4.1.1 Datasets

We have conducted experiments on both static and dynamic datasets for object classification. We train and validate PFA on a workstation equipped with one RTX 3090. For different datasets, the hyper-parameters are listed in Table 1. The summaries of datasets and augmentation involved in the experiment are listed below.

CIFAR 10/100 consist of 50k training images and 10k testing images with the size of $32\times 32$ [30]. We use ResNet-19 for both CIFAR10 and CIFAR100. Random horizontal flips and crops are applied to the training images for augmentation. Moreover, cutout is also used for augmentation which is the same as [8]
CIFAR10-DVS converts 10,000 frame-based images of 1010 classes into event streams with the dynamic vision sensor. Since the CIFAR10DVS dataset [33] does not divide training and testing sets, we split the dataset into 9k training images and 1k test images and reduced the spatial resolution from $128\times 128$ to $48\times 48$ [8]. We use VGGSNN for neuromorphic datasets CIFAR10DVS. In each frame, horizontal flip** and mixup are adopted, where the probability of Flip** is set to 0.5. Then, we randomly select one augmentation among rolling, rotation, cutout, and shear, where the random rolling range is 5 pixels, and the degree of Rotation is sampled from the uniform distribution, which is the same as the Ref [8].
NCALTECH-101 is also converted from the original version of Caltech-101 [40] with a slight change in object classes to avoid confusion. The NCaltech-101 consists of 100 object classes plus one background class. We apply the 9: 1 train-validation split as CIFAR10DVS. The VGGSNN is applied to neuromorphic datasets NCALTECH-101. Moreover, the augmentation is identical to CIFAR10DVS. Fashion-MNIST is a more demanding successor of the famous MNIST dataset. Fashion-MNIST contains 70,000 static grayscale images of 10 different categories of clothing, each image being of size 28x28 pixels. We utilize the simple example network from the Spikingjelly framework, which comprises two convolution layers and a fully-connected layer, to evaluate PFA on this dataset.

4.1.2 Loss function

We used the cross-entropy loss for the neuromorphic datasets CIFAR10DVS and NCALTECH-101. In order to get a better performance in static datasets (CIFAR 10 / 100), we used a Temporal Efficient Training [8] loss function. It can be concluded as follows:

\mathcal{L}_{\mathrm{TET}}=\frac{1-\lambda}{T}\cdot\sum_{t=1}^{T}\mathcal{L}_{% \mathrm{CE}}[\boldsymbol{O}(t),\boldsymbol{y}]+\frac{\lambda}{T}\cdot\sum_{t=1% }^{T}\mathcal{L}_{\mathrm{MSE}}(\mathbf{O}(t),\phi),

(16)

where $T$ is the total time steps, $\mathcal{L}_{\mathrm{CE}}$ and $\mathcal{L}_{\mathrm{MSE}}$ denote the cross-entropy loss and mean square error. And $\boldsymbol{y}$ represents the target label. Moreover, $\mathbf{O}(t)$ and $\phi$ are the output and a constant used to regularize the membrane potential distribution. Additionally, we set $\phi=V_{th}$ in our experiments. In practice, we use a hyper-parameter $\lambda$ to adjust the proportion of the regular term.

4.1.3 Network Architectures

To evaluate PFA, we integrate it into VGG-type [47] and ResNet [14] architectures, which are previously utilized in [12; 70; 8; 63]. Figure 6 presents the basic blocks involved in VGG and ResNet architectures. Five PFA modules are inserted into a VGG-type network after each average pooling operation. In the ResNet architecture, PFA is inserted before the residual connection in the last two layers. When splitting the event flow data into frames, we simply integrate the input in the time domain. Particularly, our work is based on the software framework SpikingJelly [10]. Taking notice that frames in CIFAR10-DVS and NCaltech-101 are entirely occupied by their content, we utilized a pooling layer as the first layer of the network to reduce memory consumption.

4.2 Comparison with Existing SOTA Works

Dataset Method Type Architecture Time steps Accuracy CIFAR10/CIFAR100 SIDP [44]^ICLR-2019 Hybrid training ResNet-20/VGGSNN 250/125 95.4/76.4 Diet [43]^TNNLS-2021 Spike-based BP ResNet-20 10/5 92.5/64.0 tdBN [68]^AAAI-2021 Spike-based BP ResNet-19 6 93.1/71.1 Dspike [34]^NeurIPS-2021 Spike-based BP ResNet-18 6 94.2/74.2 NAS-SNN [23]^ECCV-2022 Spike-based BP NAS 5 92.7/73.0 TET [8]^ICLR-2022 Spike-based BP ResNet-19 6 94.5/74.7 DSR [37]^CVPR-2022 Spike-based BP ResNet-19 20 95.4/78.5 Spikformer [69]^ICLR-2023 Spike-based BP Spiking-ViT 4 95.5/78.2 PFA Spike-based BP ResNet-19 6 95.7/79.1 4 95.7/78.1 2 95.6/76.7 CIFAR10DVS LIAF-Net [57]^TNNLS-2021 Spike-based BP LIAF-Net 10 70.4 tdBN [68]^AAAI-2021 Spike-based BP ResNet-19 10 67.8 TA-SNN [62]^ICCV-2021 Spike-based BP LIAF-Net 10 72.0 Dspike [34]^NeurIPS-2021 Spike-based BP VGGSNN 10 75.4 TET [8]^ICLR-2022 Spike-based BP VGGSNN 10 83.2 DSR [37]^CVPR-2022 Spike-based BP VGGSNN 10 77.3 TCJA [70]^TNNLS-2024 Spike-based BP VGGSNN 10 80.7 Spikformer [69]^ICLR-2023 Spike-based BP Spiking-ViT 16 80.9 PFA Spike-based BP VGGSNN 14 84.0 NCaltech-101 SALT [22]^{Neural Netw-2021} Spike-based BP VGGSNN 20 55.0 TCJA [70]^TNNLS-2024 Spike-based BP VGGSNN 14 78.5 PFA Spike-based BP VGGSNN 14 80.5

Table 2: Compare with existing works. Our method improves network performance across all tasks.

To validate the effectiveness of our PFA, we conducted a comparative analysis with several state-of-the-art SNNs, including those employing hybrid training and spike-based backpropagation. The results for both static image data and dynamic data classification tasks are summarized in Table 2. Specifically, the results for the static image dataset are presented at time steps 2, 4, and 6.

For CIFAR10 and CIFAR100 datasets, our PFA outperforms prior works in terms of accuracy. Notably, PFA achieves superior performance on CIFAR10 and CIFAR100, with respective improvements of 1.2% and 4.4% over the TET method [8] which utilizes the same network architecture and time steps. Furthermore, PFA surpasses the DSR method [37] while requiring only 10 $\times$ fewer time steps. Additionally, our attention-based SNN demonstrates better performance than the Spiking-ViT backbone [69].

Method	Time Steps	Accuracy
ST-RSBP[66] ${}^{\text{NIPS-2019}}$	400	90.1
LISNN[6] ${}^{\text{IJCAI-2020}}$	20	92.1
PLIF[12] ${}^{\text{ICCV-2021}}$	8	94.4
TCJA[70] ${}^{\text{TNNLS-2024}}$	8	94.8
PFA (Ours)	8	94.5

Table 3: Accuracy on Fashion-MNIST[59] dataset.

For the Fashion-MNIST dataset, as depicted in Table 3, while the accuracy of our proposed PFA falls slightly short of TCJA [70], it still surpasses previous methodologies utilizing longer time steps [66; 6] or incorporating learnable spike thresholds [12]. We hypothesize that the relatively diminutive image size of the Fashion-MNIST dataset curtails the efficacy of spatial attention. This conjecture aligns with the findings from our ablation experiments, indicating that spatial attention plays a pivotal role in determining PFA performance.

For the dynamic dataset CIFAR10DVS, our method outperforms previous approaches using binary spikes by 3.1%, even with fewer time steps. For the NCaltech101 dataset, we achieved an impressive 80.5% top-1 accuracy, far surpassing the performance of prior work [70].

To further illustrate the advantages of our PFA, we compare it with previous works at various time steps. As depicted in Figure 1, we compare PFA with Spike-based BP SNNs. Notably, our PFA exhibits a significant advantage at shorter simulation time steps, owing to its superior representation capability.

We observe from the perspective of simulated time steps that PFA can achieve satisfactory performance even with a smaller number of simulated time steps. For the CIFAR-10 dataset, simply utilizing 2 simulated time steps with PFA can surpass the performance of previous works, including transformer-like structures such as Spikeformer. On the CIFAR-100 dataset, compared to the top-performing Spikeformer, PFA lags by only 0.1% accuracy with 4 time steps. On DVS datasets like CIFAR-10DVS and NCaltech, employing 14 simulated time steps, PFA significantly outperforms previous methods. On CIFAR10-DVS, PFA surpasses Spikeformer by 3.1%, and compared to the TET [8] method with a similar structure, it achieves a higher accuracy by 0.8%. On the NCaltech-101 dataset, PFA outperforms the previous best TCJA method by 2%.

In summary, PFA offers substantial advantages over previous methods. When applied to attention modules like TCJA [70], PFA significantly enhances resistance to overfitting, leading to improved accuracy on the validation set. Although Spikeformer, which is based on the Spike ViT architecture, performs better than many CNN-based techniques, an especially interesting breakthrough is the incorporation of PFA into traditional CNN architectures like VGG, which can even outperform Spikeformer’s [69] performance.

4.3 Experiments on Image Generation Tasks

Dataset	Method	IS $\uparrow$	FID $\downarrow$	FAD $\downarrow$
MNIST[7]	ANN[21]^AAAI-2022	5.95	112.5	17.09
	FSVAE[21]^AAAI-2022	6.21	97.06	35.54
	TCJA[70]^TNNLS-2024	6.45	100.8	19.39
	PFA (Ours)	6.97	96.82	15.12
Fashion- MNIST[59]	ANN[21]^AAAI-2022	4.58	123.7	18.08
	FSVAE[21]^AAAI-2022	4.55	90.12	15.75
	TCJA[70]^TNNLS-2024	5.61	93.41	12.46
	PFA (Ours)	5.35	97.3	11.97
CIFAR10[29]	ANN[21]^AAAI-2022	2.59	229.6	196.9
	FSVAE[21]^AAAI-2022	2.94	175.5	133.9
	TCJA[70]^TNNLS-2024	3.73	170.1	100.4
	PFA (Ours)	3.84	166.4	92.83

Table 4: Comparison with existing methods on image generation tasks.

To further validate the efficacy of PFA, we conduct image generation tasks on MNIST [7], Fashion-MNIST [59], and CIFAR10 [29] using fully spike variation autoencoders [21] (FSVAE) . We integrated our proposed PFA as the first layer of FSVAE while maintaining other settings identical to the original network. The experimental results are outlined in Table 4. Notably, PFA exhibits superior performance metrics on both MNIST [7] and CIFAR10 [29] datasets. Additionally, we visually depict the generated images in Figure 7, further corroborating the enhancements facilitated by PFA.

4.4 Discussion on $R$

Figure 8 demonstrates the performance with different $R$ . The outcome reveals that for dynamic datasets, the best rank option falls around $\frac{T}{2}$ while for static datasets, the best value of $R$ (also known as connecting factor) is very small.

Dynamic datasets exhibit significant variations in content across different time steps. Conversely, static datasets involve the duplication of frames, where each frame is replicated $T$ times as input to the network. Consequently, this duplication results in the attention map being subjected to a low-rank condition. Considering a comprehensive assessment of both experimental outcomes and theoretical insights, our conclusion asserts that the precision of the attention map in faithfully mirroring the input tensor is unnecessary. Such fidelity does not inherently facilitate the effective concentration on pivotal components. Furthermore, our findings unveil an intriguing phenomenon: even in scenarios where SNNs exhibit considerable sparsity, the rank of the attention map can remain markedly low. This revelation implies the presence of untapped optimization potential within the realm of SNN sparsity.

It is worth noting that on the CIFAR10-DVS dataset, when $R=1$ , the performance of PFA is comparable to TCJA (see Table 2 and does not show significant advantages. This phenomenon strongly confirms our previous statement that TCJA is a "rank-1" attention, a specal case of PFA, highlighting the necessity and advantages of non-"rank-1" attention modules.

4.5 Ablation Study

In this section, we perform ablation experiments from both global and local perspectives to validate the efficacy of PFA on CIFAR10DVS [29] Initially, we vary the number of added PFA layers in the network to assess the impact on classification accuracy. Subsequently, we set the three projections obtained within the PFA to constant matrices, with each element assigned a value of 1, aiming to eliminate attention to specific dimensions. This helps explore the influence of different attention dimensions on classification results. Finally, for a more comprehensive insight into the influence of PFA on the training process, we conduct ablation experiments specifically involving the insertion of PFA.

We incorporate varying numbers of PFAs into the VGG network and present the results in Figure 9. In this experiment, the value of $R$ is held constant at 8. It is evident that the network’s accuracy exhibits a noticeable improvement with more PFA layers. Remarkably, even the addition of a single layer leads to enhanced accuracy in the network.

In order to explore how attention in temporal-channel-spatial dimensions affects classification accuracy, we conduct an ablation by fixing the corresponding projection matrix obtained by LPST to an all-one tensor. As shown in Figure 10, whichever dimension is ablated will adversely affect the results. Ablating temporal and ablating channels have similar effects. This means that PFA treats temporal and channel dimensions equally. However, when we ablate the spatial dimension, the accuracy of the network drops significantly (79.60%) but still outperforms the baseline (77.90%). Whichever dimension we ablate, the performance drops. When we restrict the focus dimensions to only one, namely, spatial, temporal, or channel, we observe another decrease in the network’s performance compared to when attention is distributed across two dimensions. The retention of spatial attention significantly impacts accuracy, followed by temporal attention. This observation highlights that PFA effectively directs attention to distinct regions across multiple dimensions, aligning with the intended purpose of attention modules. Specifically, PFA exhibits attention distribution across all three dimensions, albeit with varying weights. Notably, the spatial dimension emerges as the most significant, followed by the time and channel dimensions, in terms of attention importance.

4.6 Attention Visualization

In order to visually represent the attention distribution of the PFA module, we have generated an attention map in PFA and depicted it in Figure 11. The attention maps corresponding to the temporal and channel dimensions in the figure are obtained from the deep layer of the network. This choice was made because deep layers tend to provide a more accurate reflection of the attention’s role. Conversely, the spatial attention map is obtained from the network’s input layer, where the shallow graphic features are more distinct and conspicuous, facilitating analysis and observation. Figure 11 also illustrates the distinctive attention distributions in the temporal, channel, and spatial dimensions achieved by PFA. This finding aligns with the conclusions drawn from our ablation study, further validating our earlier assertions.

5 Conclusions

In this paper, we introduce a novel attention module named PFA, which combines learning and the generation of a temporal-channel-spatial attention map through CP decomposition. Specifically, we propose the LPST to learn a projection of the input tensor that can extract essential information for AMC to compose the attention map. In the AMC, we utilize the reverse process of CP decomposition to generate the attention map. After analyzing the rank of the raw input tensor and the best option for connecting factor $R$ , we propose a principle to search for the best $R$ . Remarkably, our findings suggest that even a low rank can lead to outstanding performance. On classification tasks, regardless of the form of the dataset, dynamic or static, PFA achieves the SOTA performance. When we extend our evaluation of PFA to image generation tasks, it still shows its capacity to enhance performance on this task as well. Furthermore, additional ablation studies bolster the rationale behind PFA’s efficacy, while visualizing the attention distribution of PFA offers intuitive insights into its substantial impact.

In the future, we plan to extend the applicability of the PFA model to various demanding intelligent applications and deploy the model on a neuromorphic chip.

References

Acar et al. [2011] Evrim Acar, Daniel M. Dunlavy, Tamara G. Kolda, and Morten Mørup. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41–56, March 2011. 10.1016/j.chemolab.2010.08.004. URL https://doi.org/10.1016/j.chemolab.2010.08.004.
Babiloni et al. [2023] Francesca Babiloni, Thomas Tanay, Jiankang Deng, Matteo Maggioni, and Stefanos Zafeiriou. Factorized dynamic fully-connected layers for neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1374–1383, 2023.
Bohte et al. [2002] Sander M. Bohte, Joost N. Kok, and Han La Poutré. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(1-4):17–37, October 2002. 10.1016/s0925-2312(01)00658-0. URL https://doi.org/10.1016/s0925-2312(01)00658-0.
Cai et al. [2022] Wuque Cai, Hongze Sun, Rui Liu, Yan Cui, Jun Wang, Yang Xia, Dezhong Yao, and Daqing Guo. A spatial-channel-temporal-fused attention for spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2209.10837.
Chen et al. [2020] Wanli Chen, Xinge Zhu, Ruoqi Sun, Junjun He, Ruiyu Li, Xiaoyong Shen, and Bei Yu. Tensor low-rank reconstruction for semantic segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 52–69, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58520-4.
Cheng et al. [2021] Xiang Cheng, Yunzhe Hao, Jiaming Xu, and Bo Xu. Lisnn: improving spiking neural networks with lateral interactions for robust object recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021. ISBN 9780999241165.
Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012. 10.1109/MSP.2012.2211477.
Deng et al. [2021] Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting. In International Conference on Learning Representations (ICLR), 2021.
Diehl et al. [2015] Peter U. Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and Michael Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2015. 10.1109/IJCNN.2015.7280696.
Fang et al. [2020] Wei Fang, Yanqi Chen, Jianhao Ding, Ding Chen, Zhaofei Yu, Huihui Zhou, Yonghong Tian, and other contributors. Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020. Accessed: 2023-01-16.
Fang et al. [2021a] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056–21069, 2021a.
Fang et al. [2021b] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating Learnable Membrane Time Constant To Enhance Learning of Spiking Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2661–2671, 2021b.
Håstad [1990] Johan Håstad. Tensor rank is NP-complete. Journal of Algorithms, 11(4):644–654, December 1990. 10.1016/0196-6774(90)90014-6. URL https://doi.org/10.1016/0196-6774(90)90014-6.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. 10.1109/cvpr.2016.90. URL https://doi.org/10.1109/cvpr.2016.90.
Hodgkin and Huxley [1952] A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500–544, August 1952. 10.1113/jphysiol.1952.sp004764. URL https://doi.org/10.1113/jphysiol.1952.sp004764.
Hu et al. [2018a] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018a.
Hu et al. [2018b] Yangfan Hu, Hua** Tang, and Gang Pan. Spiking deep residual network, 2018b. URL https://arxiv.longhoe.net/abs/1805.01352.
Hu et al. [2022] Yifan Hu, Yujie Wu, Lei Deng, and Guoqi Li. Advancing deep residual learning by solving the crux of degradation in spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2201.07209.
Izhikevich [2003] Eugene M Izhikevich. Simple model of spiking neurons. IEEE Transactions on neural networks, 14(6):1569–1572, 2003.
Janzamin et al. [2019] Majid Janzamin, Rong Ge, Jean Kossaifi, and Anima Anandkumar. 2019.
Kamata et al. [2022] Hiromichi Kamata, Yusuke Mukuta, and Tatsuya Harada. Fully spiking variational autoencoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7059–7067, 2022.
Kim and Panda [2021] Youngeun Kim and Priyadarshini Panda. Optimizing Deeper Spiking Neural Networks for Dynamic Vision Sensing. Neural Networks, 144:686–698, 2021.
Kim et al. [2022] Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, and Priyadarshini Panda. Neural architecture search for spiking neural networks. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 36–56. Springer, 2022.
Kolda and Bader [2009a] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, August 2009a. 10.1137/07070111x. URL https://doi.org/10.1137/07070111x.
Kolda and Bader [2009b] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009b. 10.1137/07070111X. URL https://doi.org/10.1137/07070111X.
Kossaifi et al. [2017] Jean Kossaifi, Zachary C. Lipton, Arinbjorn Kolbeinsson, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks, 2017. URL https://arxiv.longhoe.net/abs/1707.08308.
Kossaifi et al. [2019] Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, and Maja Pantic. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6059–6068, 2019. URL https://api.semanticscholar.org/CorpusID:214743123.
Kossaifi et al. [2020] Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, and Maja Pantic. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020. 10.1109/cvpr42600.2020.00610. URL http://dx.doi.org/10.1109/CVPR42600.2020.00610.
[29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lapique [1907] Louis Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal of Physiology and Pathology, 9:620–635, 1907.
Lau et al. [2024] Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Systems with Applications, 236:121352, February 2024. ISSN 0957-4174. 10.1016/j.eswa.2023.121352. URL http://dx.doi.org/10.1016/j.eswa.2023.121352.
Li et al. [2017] Hongmin Li, Hanchao Liu, ** Shi. CIFAR10-DVS: An event-stream dataset for object classification. Frontiers in Neuroscience, 11, May 2017. 10.3389/fnins.2017.00309. URL https://doi.org/10.3389/fnins.2017.00309.
Li et al. [2021] Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 23426–23439, 2021.
Liu and Parhi [2023] Xingyi Liu and Keshab K Parhi. Tensor decomposition for model reduction in neural networks: A review [feature]. IEEE Circuits and Systems Magazine, 23(2):8–28, 2023.
Maass [1997] Wolfgang Maass. Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997. ISSN 0893-6080. https://doi.org/10.1016/S0893-6080(97)00011-7. URL https://www.sciencedirect.com/science/article/pii/S0893608097000117.
Meng et al. [2022] Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo. Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation. ArXiv preprint arXiv:2205.00459, 2022.
Mnih et al. [2014] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention, 2014. URL https://arxiv.longhoe.net/abs/1406.6247.
Novikov et al. [2015] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. Tensorizing neural networks. In Neural Information Processing Systems, 2015. URL https://api.semanticscholar.org/CorpusID:290242.
Orchard et al. [2015] Garrick Orchard, A**kya Jayawant, Gregory K. Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in Neuroscience, 9, November 2015. 10.3389/fnins.2015.00437. URL https://doi.org/10.3389/fnins.2015.00437.
Qiu et al. [2023] Xue-Rui Qiu, Zhao-Rui Wang, Zheng Luan, Rui-Jie Zhu, Xiao Wu, Ma-Lu Zhang, and Liang-Jian Deng. Vtsnn: A virtual temporal spiking neural network. Frontiers in neuroscience, 17:1091097, 2023.
Qiu et al. [2024] Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-jian Deng, and Guoqi Li. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 601–610, 2024.
Rathi and Roy [2021] Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021.
[44] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International Conference on Learning Representations (ICLR).
Rueckauer et al. [2017] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in Neuroscience, 11, December 2017. 10.3389/fnins.2017.00682. URL https://doi.org/10.3389/fnins.2017.00682.
Sidiropoulos et al. [2017] Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017. 10.1109/TSP.2017.2690524.
Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
Wang et al. [2023] Qingsong Wang, Zehui Liu, Chunfeng Cui, and Deren Han. Inertial accelerated sgd algorithms for solving large-scale lower-rank tensor cp decomposition problems. Journal of Computational and Applied Mathematics, 423:114948, 2023.
Wang et al. [2022] Yuchen Wang, Malu Zhang, Yi Chen, and Hong Qu. Signed neuron with memory: Towards simple, accurate and high-efficient ann-snn conversion. In International Joint Conference on Artificial Intelligence, 2022.
Wei et al. [2023] Wenjie Wei, Malu Zhang, Hong Qu, Ammar Belatreche, Jian Zhang, and Hong Chen. Temporal-coded spiking neural networks with dynamic firing threshold: Learning with event-driven backpropagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10552–10562, 2023.
Wei et al. [2024] Wenjie Wei, Malu Zhang, Jilin Zhang, Ammar Belatreche, Jibin Wu, Zi**g Xu, Xuerui Qiu, Hong Chen, Yang Yang, and Haizhou Li. Event-driven learning for spiking neural networks. arXiv preprint arXiv:2403.00270, 2024.
Wu et al. [2021a] Jibin Wu, Yansong Chua, Malu Zhang, Guoqi Li, Haizhou Li, and Kay Chen Tan. A tandem learning rule for effective training and rapid inference of deep spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021a.
Wu et al. [2021b] Jibin Wu, Chenglin Xu, Xiao Han, Daquan Zhou, Malu Zhang, Haizhou Li, and Kay Chen Tan. Progressive tandem learning for pattern recognition with deep spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7824–7840, 2021b.
Wu et al. [2018] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Lu** Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12, May 2018. 10.3389/fnins.2018.00331. URL https://doi.org/10.3389/fnins.2018.00331.
Wu et al. [2019a] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct training for spiking neural networks: Faster, larger, better. Proceedings of the AAAI Conference on Artificial Intelligence, 33:1311–1318, July 2019a. 10.1609/aaai.v33i01.33011311. URL https://doi.org/10.1609/aaai.v33i01.33011311.
Wu et al. [2019b] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct Training for Spiking Neural Networks: Faster, Larger, Better. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1311–1318, 2019b. 10.1609/aaai.v33i01.33011311.
Wu et al. [2021c] Zhenzhi Wu, Hehui Zhang, Yihan Lin, Guoqi Li, Meng Wang, and Ye Tang. LIAF-Net: Leaky Integrate and Analog Fire Network for Lightweight and Efficient Spatiotemporal Information Processing. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2021c. 10.1109/TNNLS.2021.3073016.
Wu et al. [2022] Zhong-Cheng Wu, Ting-Zhu Huang, Liang-Jian Deng, Hong-Xia Dou, and Deyu Meng. Tensor wheel decomposition and its tensor completion application. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27008–27020. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/acbfe708197ff78ad04cc1beb1710979-Paper-Conference.pdf.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747, 2017. URL https://api.semanticscholar.org/CorpusID:702279.
Xu et al. [2023] Zhihao Xu, Zhiqiang Lv, Benjia Chu, and Jianbo Li. Fast autoregressive tensor decomposition for online real-time traffic flow prediction. Knowledge-Based Systems, 282:111125, December 2023. ISSN 0950-7051. 10.1016/j.knosys.2023.111125. URL http://dx.doi.org/10.1016/j.knosys.2023.111125.
Yang and Hospedales [2016] Yongxin Yang and Timothy M. Hospedales. Deep multi-task representation learning: A tensor factorisation approach. ArXiv, abs/1605.06391, 2016. URL https://api.semanticscholar.org/CorpusID:3047732.
Yao et al. [2021] Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October 2021. 10.1109/iccv48922.2021.01006. URL https://doi.org/10.1109/iccv48922.2021.01006.
Yao et al. [2023] Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Zhan et al. [2023] Qiugang Zhan, Guisong Liu, Xiurui Xie, Malu Zhang, and Guolin Sun. Bio-inspired active learning method in spiking neural network. Knowledge-Based Systems, 261:110193, February 2023. ISSN 0950-7051. 10.1016/j.knosys.2022.110193. URL http://dx.doi.org/10.1016/j.knosys.2022.110193.
Zhang et al. [2021] Malu Zhang, Jiadong Wang, Jibin Wu, Ammar Belatreche, Burin Amornpaisannon, Zhixuan Zhang, Venkata Pavan Kumar Miriyala, Hong Qu, Yansong Chua, Trevor E Carlson, et al. Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks. IEEE transactions on neural networks and learning systems, 33(5):1947–1958, 2021.
Zhang and Li [2019] Wenrui Zhang and Peng Li. Spike-train level backpropagation for training deep recurrent spiking neural networks. Curran Associates Inc., Red Hook, NY, USA, 2019.
Zheng et al. [2020] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks, 2020. URL https://arxiv.longhoe.net/abs/2011.05280.
Zheng et al. [2021] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going Deeper With Directly-Trained Larger Spiking Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11062–11070, 2021.
Zhou et al. [2023] Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=frE4fUwz_h.
Zhu et al. [2022] Rui-Jie Zhu, Qihang Zhao, Tian**g Zhang, Haoyu Deng, Yule Duan, Malu Zhang, and Liang-Jian Deng. Tcja-snn: Temporal-channel joint attention for spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2206.10177.