HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: cas-common

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2310.14576v2 [cs.LG] 11 Apr 2024
\fnref

cor

\fnref

cor \fntext[cor]Corresponding authors: Malu Zhang, Liang-Jian Deng

Tensor Decomposition Based Attention Module for Spiking Neural Networks

Haoyu Deng [email protected]    Ruijie Zhu [email protected]    Xuerui Qiu [email protected]    Yule Duan [email protected]    Malu Zhang [email protected]    Liang-Jian Deng [email protected] University of Electronic Science and Technology of China, 611731, China University of California, Santa Cruz, 95064, The United States
Abstract

[SUMMARY] The attention mechanism has been proven to be an effective way to improve the performance of spiking neural networks (SNNs). However, from the perspective of tensor decomposition to examine the existing attention modules, we find that the rank of the attention maps generated by previous methods is fixed at 1, lacking the flexibility to adjust for specific tasks. To tackle this problem, we propose an attention module, namely Projected-full Attention (PFA), where the rank of the generated attention maps can be determined based on the characteristics of different tasks. Additionally, the parameter count of PFA grows linearly with the data scale. PFA is composed of the linear projection of spike tensor (LPST) module and attention map composing (AMC) module. In LPST, we start by compressing the original spike tensor into three projected tensors with learnable parameters for each dimension. Then, in AMC, we exploit the inverse procedure of the tensor decomposition process to combine the three tensors into the attention map using a so-called connecting factor. To validate the effectiveness of the proposed PFA module, we integrate it into the widely used VGG and ResNet architectures for classification tasks. Our method achieves state-of-the-art performance on both static and dynamic benchmark datasets, surpassing the existing SNN models with Transformer-based and CNN-based backbones. Code for PFA is available at https://github.com/RisingEntropy/PFA .

keywords:
Spiking neural network \sepAttention mechanism \sepTensor decomposition \sepNeuromorphic computing

1 Introduction

Spiking neural networks (SNNs) are currently attracting the interest of academics due to their lower energy consumption and greater bio-interpretability compared to traditional artificial neural networks (ANNs)  [19; 36]. The brain-inspired computation model gives it significant potential in processing temporal data. Though training a deep SNN is still a challenge, recent developments in SNNs introduce backpropagation [3; 55; 65; 50; 53; 41] to relieve this issue to some degree. Moreover, it makes the extending of ANN modules into SNNs possible, such as batch normalization and residual blocks.

As it is now feasible to incorporate ANN modules into SNNs while maintaining the inherent power efficiency, we can leverage these modules to augment the overall performance of SNNs. A good practice for this is the introduction of residual architecture to SNNs. The well-known residual block proposed by He et al. [14] makes it possible to train an ultra-deep ANN network. To introduce the residual block into the field of SNN, many variations [17; 11; 18] are presented from different perspectives, solving problems like gradient vanishing. Moreover, strategies in ANN, such as normalization [67] and architecture, could also effectively favor SNNs’ performance, thereby motivating us to explore more useful techniques in ANNs for better-conducting SNNs.

Refer to caption
Refer to caption
Figure 1: Accuracy on CIFAR10 (left) and CIFAR100 (right). Compared with other methods, PFA significantly improves network performance.

Apart from improving network architectures, numerous other approaches can improve the performance of a network. As one of the most promising techniques in ANNs, the attention mechanism can mimic the human ability to focus on certain things while ignoring others. Three representative works [62; 70; 4] have recently proven that the attention mechanism can successfully work in SNNs. Especially, Yao et al. [62] switch channel attention to the temporal dimension, revealing that paying attention to the temporal dimension is effective for performance enhancement. Zhu et al. [70] apply two 1D convolutions to a generated 2D tensor to achieve temporal-channel attention. Inspired by experimental observations on predictive attentional remap**, Cai et al. [4] design the SCTFA module to assess the input tensor from temporal, channel, and spatial dimensions, achieving attention among these three dimensions.

Current software frameworks, i.e., PyTorch and Tensorflow, are boosted by GPU using parallel computation. This makes it hard to realize event-driven SNNs as events occur asynchronously. Based on these frameworks, as a trade-off, a popular approach is splitting the input event streams into slices and composing tensors for convenient later computation. This approach is adopted by all attention-mechanism-related works, showing a promising outcome. Since we are essentially dealing with tensor-formatted data flow, it is natural for us to consider involving tensor analysis theories to implement an attention module for SNNs, which is ignored by previous work. By utilizing mathematical tools such as tensor decomposition theory, we can examine attention modules from a more abstract perspective, thereby identifying the shortcomings and limitations of previous methods.

By expressing the composition of attention maps in previous methods through CANDECOMP/PARAFAC (CP) decomposition form, we found that they are essentially ‘rank-1’ methods. When applied to various tasks, they lack a certain degree of flexibility. In this paper, we introduce a module called Projected-Full Attention (PFA), which is capable of generating attention maps with a rank other than 1. Specifically, PFA consists of two components: Linear Projection of Spike Tensor (LPST) and Attention Map Composing (AMC). LPST is responsible for generating R𝑅Ritalic_R sets of projections from the input tensor using a small number of parameters. AMC is responsible for creating attention maps using these R𝑅Ritalic_R sets of projections, where R𝑅Ritalic_R is called The connecting factor representing the rank concept of the CP decomposition and is a hyper-parameter that can be adjusted based on the specific task.

The key contributions of this paper are outlined as follows:

  • We propose PFA, a module that can achieve temporal-channel-spatial attention. In contrast to previous approaches, the ranks of the attention maps generated by PFA are not fixed at 1; instead, the rank can be chosen based on the specific task. Furthermore, the parameter count of PFA grows linearly with the data scale, and the computational cost is equivalent to that of a single standard convolution layer.

  • We present a comprehensive theoretical analysis, focusing on two critical aspects: the rank of tensors and R𝑅Ritalic_R, the connecting factor. These analyses lead to a tailored selection criterion, providing effective insights into experimental outcomes and offering a valuable guidance for future applications.

  • We conduct thorough experiments on both static and dynamic datasets. The results demonstrate the effectiveness of PFA which achieves state-of-the-art (SOTA) accuracy on both dynamic and static datasets. Ablation studies further validate the structural reasonableness of PFA. Additionally, we visualize the attention map generated by PFA to intuitively illustrate attention distribution across temporal, channel, and spatial dimensions.

2 Related Works

Spiking Neural Network: Spiking Neural Networks (SNNs), the third generation of neural networks [19; 36], offer a closer emulation of the human brain’s efficiency by utilizing discrete spikes for information transmission [64]. This unique method of communication allows for enhanced energy-saving feature compared to traditional Artificial Neural Networks (ANNs), which rely on continuous signals. The temporal representation capabilities of SNNs have been a focal point in research [62; 8; 64], drawing from seminal neurodynamic models such as the Hodgkin-Huxley (H-H) model [15] and further developed in works like Izhikevich [19] and Leaky Integrate and Fire (LIF) [31]. These models underscore SNNs’ potential in capturing time-dependent patterns in data, a feature not inherently present in ANNs. The challenges in training SNNs, due to their non-differentiable spiking behavior, have also been addressed in recent literature, with gradient-based optimization methods [54; 56; 51] and ANN-to-SNN conversion techniques [45; 9; 49; 52] being key developments. In addition to research on training methods, there are many other studies dedicated to improving the performance of SNNs, such as attention mechanisms.

Attention Mechanism in SNNs: Since Google proposed recurrent models of visual attention in 2014 [38], the attention mechanism become a potent tool to increase the performance of a neural network, which gives distinct parts of the input data varied weights. For instance, the squeeze-and-excitation network (SENet) [16] gives different weights to different channels of input data and significantly improves the performance. From this fact, it is of vital significance to introduce the attention mechanism into SNNs. Qiu et al. apply the attention mechanism to the encoding layers of SNNs [42]. To achieve temporal-wise attention, inspired by SENet, TA-SNN [62] is proposed. Besides, TCJA [70] uses two 1D convolutions to pay attention to both channel dimension and temporal dimension to prolong the attention mechanism. Nevertheless, they fail to consider three crucial aspects of SNNs, i.e. temporal, channel, and spatial. By changing SENet and leaving affection in the temporal direction, Cai et al. [4] and Yao et al. [63] successfully apply temporal-channel-spatial attention to SNNs. While the methods of predecessors have been very successful, they do not take into account the fact that the input data is a high-order tensor. Examining data from a tensor perspective can provide a more mathematically abstract viewpoint, revealing additional properties. As described in Section 3.1, the methods employed by previous researchers result in ‘rank-1’ attention maps, lacking the flexibility and specificity for different tasks.

Tensor Decomposition: Tensor decomposition methods [20; 46; 25] have undergone years of development and achieved remarkable success. Many prior studies [28; 61; 39; 26; 2; 35; 48] have embraced these techniques to optimize neural network architectures or uncover versatile modules with multiple functionalities. Kossaifi et al., for instance, conducted a comprehensive review of convolution kernels from the perspective of tensor decomposition and introduced an innovative convolution module that can seamlessly extend to higher dimensions. Lau et al. propose a new decomposition method to enable the direct use of the depth-wise convolutional layer with large kernels in the attention module, without requiring any extra blocks [32]. Apart from using tensor decomposition methods to improve network structures, tensor decomposition theory can also impose certain constraints on the data itself [58; 60; 5]. Xu et al. leverage tensor decomposition to predict real-time traffic flow, demonstrating its effectiveness in handling complex patterns [60]. Chen et al. employ tensor CP decomposition to generate an attention map for segmentation, achieving significant improvements [5]. These methods either use tensor decomposition to reduce the parameter count in processing high-dimensional data or leverage low-rank properties to enhance performance. In this paper, drawing inspiration from these prior endeavors, we leverage tensor decomposition methods to enhance the field of SNNs. We design a lightweight attention module, designed to be both parameter-efficient and computationally efficient.

3 Motivation and Method

3.1 Motivation

Tensor CP decomposition [24], is a mathematical technique employed to decompose a high-order tensor into a sum of rank-one tensors. Mathematically, for a tensor denoted as 𝒳D1,D2,,Dn𝒳superscriptsubscript𝐷1subscript𝐷2subscript𝐷𝑛\mathcal{X}\in\mathbb{R}^{D_{1},D_{2},\dots,D_{n}}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the CP decomposition can be expressed as:

𝒳r=1RU1,r(D1)U2,r(D2)Un,r(Dn).𝒳superscriptsubscript𝑟1𝑅superscriptsubscript𝑈1𝑟subscript𝐷1superscriptsubscript𝑈2𝑟subscript𝐷2superscriptsubscript𝑈𝑛𝑟subscript𝐷𝑛\mathcal{X}\approx\sum_{r=1}^{R}U_{1,r}^{(D_{1})}\circ U_{2,r}^{(D_{2})}\circ% \dots\circ U_{n,r}^{(D_{n})}.caligraphic_X ≈ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ italic_U start_POSTSUBSCRIPT 2 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_U start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT . (1)

In this equation, U1,r(D1)U2,r(D2)Un,r(Dn)superscriptsubscript𝑈1𝑟subscript𝐷1superscriptsubscript𝑈2𝑟subscript𝐷2superscriptsubscript𝑈𝑛𝑟subscript𝐷𝑛U_{1,r}^{(D_{1})}\circ U_{2,r}^{(D_{2})}\circ\dots\circ U_{n,r}^{(D_{n})}italic_U start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ italic_U start_POSTSUBSCRIPT 2 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_U start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT represents the r𝑟ritalic_r-th rank-one tensor, where the symbol \circ signifies the outer product of vectors, and R𝑅Ritalic_R denotes the rank. Specifically, Un,r(D1)superscriptsubscript𝑈𝑛𝑟subscript𝐷1U_{n,r}^{(D_{1})}italic_U start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT represents a vector corresponding to the n𝑛nitalic_n-th dimension in the r𝑟ritalic_r-th rank-one tensor, with a dimensionality of D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Rethinking previous attention modules [62; 70; 4] from the perspective of CP decomposition, we can find that they are all rank-one methods. For the sake of clarity, we denote an all-one vector with a dimensionality of D𝐷Ditalic_D as I(D)superscript𝐼𝐷I^{(D)}italic_I start_POSTSUPERSCRIPT ( italic_D ) end_POSTSUPERSCRIPT. Specifically, for TA-SNN [62], the temporal attention module for SNNs yields the attention map 𝒜TA-SNNsubscript𝒜TA-SNN\mathcal{A}_{\text{TA-SNN}}caligraphic_A start_POSTSUBSCRIPT TA-SNN end_POSTSUBSCRIPT for the input tensor 𝒳𝒳\mathcal{X}caligraphic_X as follows:

𝒜TA-SNN=IS(HW)I(C)T(T)(𝒳).subscript𝒜TA-SNNsubscriptsuperscript𝐼𝐻𝑊𝑆superscript𝐼𝐶superscriptsubscript𝑇𝑇𝒳\mathcal{A}_{\text{TA-SNN}}=I^{(HW)}_{S}\circ I^{(C)}\circ\mathcal{F}_{T}^{(T)% }(\mathcal{X}).caligraphic_A start_POSTSUBSCRIPT TA-SNN end_POSTSUBSCRIPT = italic_I start_POSTSUPERSCRIPT ( italic_H italic_W ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∘ italic_I start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( caligraphic_X ) . (2)

Here, T(T)(𝒳)superscriptsubscript𝑇𝑇𝒳\mathcal{F}_{T}^{(T)}(\mathcal{X})caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ( caligraphic_X ) represents a function that generates the temporal attention vector with a dimensionality of T𝑇Titalic_T. The same mathematical form can also be used to reconfigure the TCJA [70] and SCTFA [4] method. For TCJA, the attention map 𝒜TCJAsubscript𝒜TCJA\mathcal{A}_{\text{TCJA}}caligraphic_A start_POSTSUBSCRIPT TCJA end_POSTSUBSCRIPT can be written as:

𝒜TCJA=IS(HW)C(C)(𝒳)T(T)(𝒳).subscript𝒜TCJAsubscriptsuperscript𝐼𝐻𝑊𝑆subscriptsuperscript𝐶𝐶𝒳subscriptsuperscript𝑇𝑇𝒳\mathcal{A}_{\text{TCJA}}=I^{(HW)}_{S}\circ\mathcal{F}^{(C)}_{C}(\mathcal{X})% \circ\mathcal{F}^{(T)}_{T}(\mathcal{X}).caligraphic_A start_POSTSUBSCRIPT TCJA end_POSTSUBSCRIPT = italic_I start_POSTSUPERSCRIPT ( italic_H italic_W ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X ) ∘ caligraphic_F start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ) . (3)

For SCTFA, the attention map 𝒜SCTFAsubscript𝒜SCTFA\mathcal{A}_{\text{SCTFA}}caligraphic_A start_POSTSUBSCRIPT SCTFA end_POSTSUBSCRIPT is:

𝒜SCTFA=S(HW)(𝒳)C(C)(𝒳)T(T)(𝒳).subscript𝒜SCTFAsubscriptsuperscript𝐻𝑊𝑆𝒳subscriptsuperscript𝐶𝐶𝒳subscriptsuperscript𝑇𝑇𝒳\mathcal{A}_{\text{SCTFA}}=\mathcal{F}^{(HW)}_{S}(\mathcal{X})\circ\mathcal{F}% ^{(C)}_{C}(\mathcal{X})\circ\mathcal{F}^{(T)}_{T}(\mathcal{X}).caligraphic_A start_POSTSUBSCRIPT SCTFA end_POSTSUBSCRIPT = caligraphic_F start_POSTSUPERSCRIPT ( italic_H italic_W ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_X ) ∘ caligraphic_F start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X ) ∘ caligraphic_F start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ) . (4)

By comparing Equation 1 and Equation 2, 3, 4, we can easily see that the R𝑅Ritalic_R of Equation 2, 3, 4 is fixed at 1, making them special cases of Equation 1. This motivates us to design an attention module that the rank of the attention map is not fixed at 1, allowing us to control the attention effect by adjusting the rank of the attention map; namely, we wish our proposed module could be written in a form like:

𝒜PFA=r=1RUs,r(HW)Uc,r(C)Ut,r(T).subscript𝒜𝑃𝐹𝐴superscriptsubscript𝑟1𝑅subscriptsuperscript𝑈𝐻𝑊𝑠𝑟subscriptsuperscript𝑈𝐶𝑐𝑟subscriptsuperscript𝑈𝑇𝑡𝑟\mathcal{A}_{PFA}=\sum_{r=1}^{R}U^{(HW)}_{s,r}\circ U^{(C)}_{c,r}\circ U^{(T)}% _{t,r}.caligraphic_A start_POSTSUBSCRIPT italic_P italic_F italic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_H italic_W ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT ∘ italic_U start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_r end_POSTSUBSCRIPT ∘ italic_U start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT . (5)

Through this way, we can use R𝑅Ritalic_R to control certain properties of the attention map, such as low rank and the degree of compression of the original tensor. From the perspective of tensor decomposition, previous methods can be seen as special cases of our method when R=1𝑅1R=1italic_R = 1. However, due to the absence of summation terms in previous methods, adjustments tailored to the specificity of the dataset cannot be made, lacking a certain degree of flexibility. The flexibility mentioned is crucial, as demonstrated by the experiments presented in Section 4.4. The results indicate that the optimal choice for R𝑅Ritalic_R varies across different datasets.

3.2 Projected Full-Attention (PFA)

Refer to caption
Figure 2: A simple comparison curve of parameter quantity growth among TCJA [70], TA-SNN [62], and PFA. The parameter scale of PFA increases linearly. In this figure, the channel number is fixed at 128.
Refer to caption
Figure 3: The detailed workflow of PFA. The input tensor is first sent to Linear Projection of Spike Tensor (LPST) module to generate three projections and split the three projections into corresponding vectors. In Attention Map Composing (AMC) module, these vectors are composed into the final attention map through the inverse process of CP decomposition. The attention is fused with the input tensor to obtain the refined tensor by Hadamard product.

In this section, we will present an overview of our proposed PFA module and detailed descriptions of two sub-modules of PFA. PFA is composed of two sub-modules, Linear Projection of Spike Tensor (LPST) and Attention Map Composing (AMC). LPST, as its name suggests, produces projections from input tensor 𝒳𝒳\mathcal{X}caligraphic_X for AMC in the form of matrices. The AMC module splits the matrices from LPST to vectors to compose the final attention map.

Refer to caption
Figure 4: Comparison of accuracy between the vanilla VGG network and VGG network with PFA modules on training and validation sets. While the training set accuracy remains consistently high for both models, the validation set accuracy shows a significant improvement with the addition of PFA modules, suggesting its efficacy in mitigating overfitting.

The core of the attention mechanism is to give essential parts of the input data higher weights while ignoring less-related parts to some degree. This is particularly important for SNNs, especially for data collected by DVS cameras. Data from DVS camera often contains more noise, which greatly affects network performance. Thanks to the feature of CP-decomposition, PFA can effectively suppress noise. Also, as shown in Figure 4, the inexact representation of the input tensor can help reduce overfitting, which is very severe in SNNs. In our proposed PFA, we first design the LPST module to project the higher-order tensor (generally 3D) to three lower-order tensors, i.e., 2D matrices. Using learnable parameters, LPST can selectively reserve essential information while ignoring irrelevant ones. In AMC, we split the matrices into R𝑅Ritalic_R (called connecting factor here) vectors. Three vectors from three matrices of temporal, channel, and spatial dimensions respectively are used to conduct the outer product to compose a part of the final attention map. The final attention map is the sum of these R𝑅Ritalic_R parts. We fuse this attention map with the origin input to obtain the refined tensor.

The workflow of PFA is plotted in Figure 3. In what follows, we will give the details of LPST and AMC.

3.2.1 Linear Projection of Spike Tensor (LPST)

Tensor decomposition is a vital approach to represent a tensor and its latent features. Therefore, it is natural for us to utilize tensor-decomposition-related techniques to explore the data characteristics. A typical technique is the CP decomposition. Whatever way we adopt, it is essential to generate a ‘projection’ to extract significant components of the input tensor to yield an attention map later.

In this part, we need to construct three linear projections of the input data stream along spatial, channel, and temporal dimensions. In LPST, we apply two fully connected (FC) layers and a convolution layer to project the input tensor for simplicity. Note that here we simply use single-layer FCs and a convolution layer. What we do here is just a simple projection with learnable parameters rather than complex function fitting. In this way, we retain the required information and reduce the number of parameters.

For the temporal dimension, we first squeeze the input tensor (H×W×C×T𝐻𝑊𝐶𝑇H\times W\times C\times Titalic_H × italic_W × italic_C × italic_T) in the following way:

Y(c,t)T=1HWi=1Hj=1W𝒳(c,t,i,j),subscriptsuperscript𝑌𝑇𝑐𝑡1𝐻𝑊superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊subscript𝒳𝑐𝑡𝑖𝑗Y^{T}_{(c,t)}=\frac{1}{HW}\sum\limits_{i=1}^{H}\sum\limits_{j=1}^{W}\mathcal{X% }_{(c,t,i,j)},italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_c , italic_t ) end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT ( italic_c , italic_t , italic_i , italic_j ) end_POSTSUBSCRIPT , (6)

where Y(c,t)Tsubscriptsuperscript𝑌𝑇𝑐𝑡Y^{T}_{(c,t)}italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_c , italic_t ) end_POSTSUBSCRIPT denotes the output matrix, H𝐻Hitalic_H and W𝑊Witalic_W denote the spatial sizes, 𝒳(i,j,c,t)subscript𝒳𝑖𝑗𝑐𝑡\mathcal{X}_{(i,j,c,t)}caligraphic_X start_POSTSUBSCRIPT ( italic_i , italic_j , italic_c , italic_t ) end_POSTSUBSCRIPT denotes the input tensor. After the squeeze operation, the tensor is converted to a matrix in the shape of C×T𝐶𝑇C\times Titalic_C × italic_T. Then, we apply an FC layer to reserve R𝑅Ritalic_R necessary information in each column. A sigmoid activation function is applied after the FC layer. Finally, we obtain a projection matrix in the shape of r×T𝑟𝑇r\times Titalic_r × italic_T.

For the channel dimension, the operation is similar to that of the temporal dimension. We squeeze the input tensor by:

Y(c,t)C=1HWi=1Hj=1W𝒳(c,t,i,j),subscriptsuperscript𝑌𝐶𝑐𝑡1𝐻𝑊superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊subscript𝒳𝑐𝑡𝑖𝑗Y^{C}_{(c,t)}=\frac{1}{HW}\sum\limits_{i=1}^{H}\sum\limits_{j=1}^{W}\mathcal{X% }_{(c,t,i,j)},italic_Y start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_c , italic_t ) end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT ( italic_c , italic_t , italic_i , italic_j ) end_POSTSUBSCRIPT , (7)

where Y(c,t)Csubscriptsuperscript𝑌𝐶𝑐𝑡Y^{C}_{(c,t)}italic_Y start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_c , italic_t ) end_POSTSUBSCRIPT denotes the output matrix, C𝐶Citalic_C denotes the number of channel. Then, we pass the T×C𝑇𝐶T\times Citalic_T × italic_C matrix through an FC layer to acquire the projected matrix in the shape of r×T𝑟𝑇r\times Titalic_r × italic_T with the sigmoid activation function applied.

For spatial dimension, the channel is squeezed by:

Y(t,x,y)S=1Ck=1C𝒳(k,t,x,y).subscriptsuperscript𝑌𝑆𝑡𝑥𝑦1𝐶superscriptsubscript𝑘1𝐶subscript𝒳𝑘𝑡𝑥𝑦Y^{S}_{(t,x,y)}=\frac{1}{C}\sum\limits_{k=1}^{C}\mathcal{X}_{(k,t,x,y)}.italic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_t , italic_x , italic_y ) end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT ( italic_k , italic_t , italic_x , italic_y ) end_POSTSUBSCRIPT . (8)

Afterward, we employ a convolution to the squeezed tensor, treating T𝑇Titalic_T as the input channel and R𝑅Ritalic_R as the output channel. Then, we reshape the output tensor from the convolution into a matrix in the shape of HW×r𝐻𝑊𝑟HW\times ritalic_H italic_W × italic_r. The sigmoid activation function is applied.

3.2.2 Attention Map Composing (AMC)

Once we obtain the three projections corresponding to three dimensions, a necessary step is to compose an attention map using the three projections. Here, we adopt the reverse process of the tensor CP decomposition. In the CP decomposition, a tensor 𝒳HW×C×T𝒳superscript𝐻𝑊𝐶𝑇\mathcal{X}\in\mathcal{R}^{HW\times C\times T}caligraphic_X ∈ caligraphic_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C × italic_T end_POSTSUPERSCRIPT can be represented as:

𝒜s,c,t=r=1RUs,r(HW)Uc,r(C)Ut,r(T),subscript𝒜𝑠𝑐𝑡superscriptsubscript𝑟1𝑅subscriptsuperscript𝑈𝐻𝑊𝑠𝑟subscriptsuperscript𝑈𝐶𝑐𝑟subscriptsuperscript𝑈𝑇𝑡𝑟\mathcal{A}_{s,c,t}=\sum_{r=1}^{R}U^{(HW)}_{s,r}\circ U^{(C)}_{c,r}\circ U^{(T% )}_{t,r},caligraphic_A start_POSTSUBSCRIPT italic_s , italic_c , italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT ( italic_H italic_W ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_r end_POSTSUBSCRIPT ∘ italic_U start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_r end_POSTSUBSCRIPT ∘ italic_U start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , (9)

where Usisubscript𝑈𝑠𝑖U_{si}italic_U start_POSTSUBSCRIPT italic_s italic_i end_POSTSUBSCRIPT, Ucisubscript𝑈𝑐𝑖U_{ci}italic_U start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT and Utisubscript𝑈𝑡𝑖U_{ti}italic_U start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT denote the i𝑖iitalic_i-th vectors in the spatial, channel, and temporal projection tensors respectively, 𝒜𝒜\mathcal{A}caligraphic_A is the attention map, and \circ is the outer product of vectors. In particular, R𝑅Ritalic_R, the connecting factor, can be viewed as the rank concept in the tensor CP decomposition (this process is plotted in Figure 3). It is worth noting that the tensor CP decomposition is just an approximate representation of tensors, which perfectly meets our demands. R𝑅Ritalic_R can control the accuracy of our representation of the input tensor. That is, the larger R𝑅Ritalic_R is, the more accurate the representation is, and vice versa, which perfectly realizes the attention mechanism we need. In the experiment part, we discuss the choice of connecting factor R𝑅Ritalic_R and its effect in detail.

3.3 Parameter and Computational Cost Analysis

This section presents an analysis of the parameter and computational cost of PFA. PFA has an advantage in terms of parameter quantity. For an input tensor with the shape of H×W×C×T𝐻𝑊𝐶𝑇H\times W\times C\times Titalic_H × italic_W × italic_C × italic_T, the overall parameter amount is:

C×RMLP of YT+T×RMLP of YC+k2×T×RConv of YS,subscript𝐶𝑅MLP of superscript𝑌𝑇subscript𝑇𝑅MLP of superscript𝑌𝐶subscriptsuperscript𝑘2𝑇𝑅Conv of superscript𝑌𝑆\underbrace{C\times R}_{\text{MLP of }Y^{T}}+\underbrace{T\times R}_{\text{MLP% of }Y^{C}}+\underbrace{k^{2}\times T\times R}_{\text{Conv of }Y^{S}},under⏟ start_ARG italic_C × italic_R end_ARG start_POSTSUBSCRIPT MLP of italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_T × italic_R end_ARG start_POSTSUBSCRIPT MLP of italic_Y start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_T × italic_R end_ARG start_POSTSUBSCRIPT Conv of italic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (10)

where k𝑘kitalic_k is the convolution kernel size. Note that R𝑅Ritalic_R is limited to a small fixed value according to our theoretical analysis in Section 3.4, and the parameter size grows linearly with T𝑇Titalic_T and C𝐶Citalic_C. Figure 2 shows the parameter growth curve of PFA and compares several other techniques.

PFA is also lightweight in computational burden. The time complexity of PFA can be calculated in the following way:

3HWTCObtain YT,YC,YS+2TCRMLP of YT,YC+HWk2TRConv of YS+RHWTCAMC.subscript3𝐻𝑊𝑇𝐶Obtain superscript𝑌𝑇superscript𝑌𝐶superscript𝑌𝑆subscript2𝑇𝐶𝑅MLP of superscript𝑌𝑇superscript𝑌𝐶subscript𝐻𝑊superscript𝑘2𝑇𝑅Conv of superscript𝑌𝑆subscript𝑅𝐻𝑊𝑇𝐶AMC\underbrace{3HWTC}_{\text{Obtain }Y^{T},Y^{C},Y^{S}}+\underbrace{2TCR}_{\text{% MLP of }Y^{T},Y^{C}}+\underbrace{HWk^{2}TR}_{\text{Conv of }Y^{S}}+\underbrace% {RHWTC}_{\text{AMC}}.under⏟ start_ARG 3 italic_H italic_W italic_T italic_C end_ARG start_POSTSUBSCRIPT Obtain italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG 2 italic_T italic_C italic_R end_ARG start_POSTSUBSCRIPT MLP of italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_H italic_W italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_R end_ARG start_POSTSUBSCRIPT Conv of italic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_R italic_H italic_W italic_T italic_C end_ARG start_POSTSUBSCRIPT AMC end_POSTSUBSCRIPT . (11)

It is worth noting that a standard convolution operation typically entails a computational burden of HWk2TCinCout𝐻𝑊superscript𝑘2𝑇subscript𝐶𝑖𝑛subscript𝐶𝑜𝑢𝑡HWk^{2}TC_{in}C_{out}italic_H italic_W italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. In this context, given that R𝑅Ritalic_R is not large, a PFA module incurs a computational load that is smaller than that of a convolution operation since C𝐶Citalic_C is usually in the range of several hundred.

3.4 Theoretical Analysis on R𝑅Ritalic_R

Refer to caption
Figure 5: A toy example of the effect of different ranks on the approximation outcome. error is measured under 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. The norm starts to cover when the rank exceeds 30.

In light of the fact that CP decomposition provides only an approximate representation of a tensor, and the quality of this approximation relies on the chosen factor, denoted as R𝑅Ritalic_R, it is essential to establish a clear guiding principle for determining the optimal value of R𝑅Ritalic_R. This section delves into our approach for gaining insights into the characteristics of PFA and streamlining its future applications.

Our exploration begins by adopting the methodology put forth by Acar et al. [1] to investigate the rank of the input data. However, due to the recognized NP-hard nature of computing the CP decomposition of a tensor [13; 24], a direct calculation from the tensor becomes impractical. To surmount this challenge, we employ a gradient descent technique to approximate the original tensor by systematically experimenting with different rank values, ultimately selecting the one that yields the best approximation for the tensor. To provide further clarity, consider an element of the tensor represented as:

xijkr=1Rai,rbj,rck,r.subscript𝑥𝑖𝑗𝑘superscriptsubscript𝑟1𝑅subscript𝑎𝑖𝑟subscript𝑏𝑗𝑟subscript𝑐𝑘𝑟x_{ijk}\approx\sum\limits_{r=1}^{R}a_{i,r}b_{j,r}c_{k,r}.italic_x start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT . (12)

We treat this as an optimization problem, with ai,rsubscript𝑎𝑖𝑟a_{i,r}italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT, bj,rsubscript𝑏𝑗𝑟b_{j,r}italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT, and ck,rsubscript𝑐𝑘𝑟c_{k,r}italic_c start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT considered as parameters. We then optimize the loss function defined as:

=12(i,j,k)Ω(xi,j,kr=1Rai,rbj,rck,r)2.12subscript𝑖𝑗𝑘Ωsuperscriptsubscript𝑥𝑖𝑗𝑘superscriptsubscript𝑟1𝑅subscript𝑎𝑖𝑟subscript𝑏𝑗𝑟subscript𝑐𝑘𝑟2\mathcal{L}=\frac{1}{2}\sum_{(i,j,k)\in\Omega}(x_{i,j,k}-\sum\limits_{r=1}^{R}% a_{i,r}b_{j,r}c_{k,r})^{2}.caligraphic_L = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) ∈ roman_Ω end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

To perform optimization, we compute the first-order partial derivative of the loss with respect to a parameter (e.g., ai,rsubscript𝑎𝑖𝑟a_{i,r}italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT), given by:

ai,r=122(xi,j,kr=1Rai,rbj,rck,r)bj,rck,r.subscript𝑎𝑖𝑟122subscript𝑥𝑖𝑗𝑘superscriptsubscript𝑟1𝑅subscript𝑎𝑖𝑟subscript𝑏𝑗𝑟subscript𝑐𝑘𝑟subscript𝑏𝑗𝑟subscript𝑐𝑘𝑟\frac{\partial\mathcal{L}}{\partial a_{i,r}}=-\frac{1}{2}\cdot 2\cdot(x_{i,j,k% }-\sum\limits_{r=1}^{R}a_{i,r}b_{j,r}c_{k,r})\cdot b_{j,r}c_{k,r}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT end_ARG = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ 2 ⋅ ( italic_x start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT ) ⋅ italic_b start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT . (14)

Finally, we apply the gradient descent method for optimization:

ai,rai,rμai,r,superscriptsubscript𝑎𝑖𝑟subscript𝑎𝑖𝑟𝜇subscript𝑎𝑖𝑟a_{i,r}^{\prime}\leftarrow a_{i,r}-\mu\cdot\frac{\partial\mathcal{L}}{\partial a% _{i,r}},italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT - italic_μ ⋅ divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_a start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT end_ARG , (15)

Here, μ𝜇\muitalic_μ represents the step size used in the gradient descent process. This approach allows us to effectively estimate the rank of the tensor while circumventing the computational complexities associated with CP decomposition.

We take a piece of data from the dataset CIFAR10DVS as a toy example and discover its rank. In this case, we adopt μ=0.0001𝜇0.0001\mu=0.0001italic_μ = 0.0001 and iterate the process 1000 times to get the final optimized parameters. We utilize 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm as the metric to judge the approximation effect. As shown in Figure 5, the error reduces as the rank increases, and starts to converge when the rank exceeds 30. This toy example shows that the rank of the input tensor is around 30, indicating that we need a rank exceeding 30 to represent it using CP decomposition precisely. It should be noted that accurately representing the original tensor for the attention mechanism is not conducive to improving network performance. The core idea of the attention mechanism is to focus on the important parts while ignoring the secondaries. Representing raw tensors too precisely is counterproductive. On the one hand, overly precise characterization of input tensors is detrimental to neglecting unimportant parts. On the other hand, it may lead to severe overfitting. As shown in Figure 4, the vanilla VGG network exhibits severe overfitting, achieving nearly 100% accuracy on the training set but less than 80% on the validation set. However, upon incorporating the PFA modules, there is not a substantial improvement in training set accuracy. Nevertheless, there is a significant increase in validation set accuracy. This observation underscores the benefit of utilizing less precise tensor representations with PFA to mitigate overfitting. Therefore, a good choice for R𝑅Ritalic_R should not be too large. This is a very useful conclusion. On the one hand, it tells us the range of finding R𝑅Ritalic_R. On the other hand, it also ensures that PFA can obtain good results without excessive calculation.

As shown in our experiment (Figure 8), though the rank of the attention map differs on dynamic datasets like CIFAR10DVS and static datasets like CIFAR100, the best option for R𝑅Ritalic_R is not very big. Further observing the experimental results, we find that when the value of R𝑅Ritalic_R exceeds a certain threshold (approximately T2𝑇2\frac{T}{2}divide start_ARG italic_T end_ARG start_ARG 2 end_ARG, T𝑇Titalic_T is the time step), the larger R𝑅Ritalic_R is, the worse the effect will be. The above analysis of the experimental results proves the correctness of our conclusion. Considering the above analysis, we propose a principle for selecting R𝑅Ritalic_R: for dynamic datasets, the search for the optimal value of R𝑅Ritalic_R is performed within the range of no more than T𝑇Titalic_T. Specifically, we focus on identifying the best R𝑅Ritalic_R value around T2𝑇2\frac{T}{2}divide start_ARG italic_T end_ARG start_ARG 2 end_ARG, considering the varying content across different time steps. In the case of static datasets, the search for the optimal R𝑅Ritalic_R value begins from 1 and extends until the best value is determined. Since static datasets involve duplicated frames, this search aims to find the most suitable R𝑅Ritalic_R value for the low-rank condition.

4 Experiment

We evaluate the classification performance of our PFA on two static datasets: CIFAR10 [30], CIFAR100 [30] and two dynamic datasets: CIFAR10DVS [33], NCaltech-101 [40]. Details of the datasets, network architecture, data augmentation, loss function, and pre-process procedure is introduced in this section. Extensive experiment results are also presented.

Dataset Learning Rate Epoch Batch Size
CIFAR10 0.1 200 128
CIFAR100 0.1 200 64
CIFAR10DVS 0.0001 200 32
NCaltech-101 0.0001 200 32
Table 1: Hyper-parameter settings of PFA.

4.1 Datasets and Training Details

4.1.1 Datasets

We have conducted experiments on both static and dynamic datasets for object classification. We train and validate PFA on a workstation equipped with one RTX 3090. For different datasets, the hyper-parameters are listed in Table 1. The summaries of datasets and augmentation involved in the experiment are listed below.

CIFAR 10/100 consist of 50k training images and 10k testing images with the size of 32×32323232\times 3232 × 32 [30]. We use ResNet-19 for both CIFAR10 and CIFAR100. Random horizontal flips and crops are applied to the training images for augmentation. Moreover, cutout is also used for augmentation which is the same as [8]
CIFAR10-DVS converts 10,000 frame-based images of 1010 classes into event streams with the dynamic vision sensor. Since the CIFAR10DVS dataset [33] does not divide training and testing sets, we split the dataset into 9k training images and 1k test images and reduced the spatial resolution from 128×128128128128\times 128128 × 128 to 48×48484848\times 4848 × 48 [8]. We use VGGSNN for neuromorphic datasets CIFAR10DVS. In each frame, horizontal flip** and mixup are adopted, where the probability of Flip** is set to 0.5. Then, we randomly select one augmentation among rolling, rotation, cutout, and shear, where the random rolling range is 5 pixels, and the degree of Rotation is sampled from the uniform distribution, which is the same as the Ref [8].
NCALTECH-101 is also converted from the original version of Caltech-101 [40] with a slight change in object classes to avoid confusion. The NCaltech-101 consists of 100 object classes plus one background class. We apply the 9: 1 train-validation split as CIFAR10DVS. The VGGSNN is applied to neuromorphic datasets NCALTECH-101. Moreover, the augmentation is identical to CIFAR10DVS. Fashion-MNIST is a more demanding successor of the famous MNIST dataset. Fashion-MNIST contains 70,000 static grayscale images of 10 different categories of clothing, each image being of size 28x28 pixels. We utilize the simple example network from the Spikingjelly framework, which comprises two convolution layers and a fully-connected layer, to evaluate PFA on this dataset.

4.1.2 Loss function

We used the cross-entropy loss for the neuromorphic datasets CIFAR10DVS and NCALTECH-101. In order to get a better performance in static datasets (CIFAR 10 / 100), we used a Temporal Efficient Training [8] loss function. It can be concluded as follows:

TET=1λTt=1TCE[𝑶(t),𝒚]+λTt=1TMSE(𝐎(t),ϕ),subscriptTET1𝜆𝑇superscriptsubscript𝑡1𝑇subscriptCE𝑶𝑡𝒚𝜆𝑇superscriptsubscript𝑡1𝑇subscriptMSE𝐎𝑡italic-ϕ\mathcal{L}_{\mathrm{TET}}=\frac{1-\lambda}{T}\cdot\sum_{t=1}^{T}\mathcal{L}_{% \mathrm{CE}}[\boldsymbol{O}(t),\boldsymbol{y}]+\frac{\lambda}{T}\cdot\sum_{t=1% }^{T}\mathcal{L}_{\mathrm{MSE}}(\mathbf{O}(t),\phi),caligraphic_L start_POSTSUBSCRIPT roman_TET end_POSTSUBSCRIPT = divide start_ARG 1 - italic_λ end_ARG start_ARG italic_T end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT [ bold_italic_O ( italic_t ) , bold_italic_y ] + divide start_ARG italic_λ end_ARG start_ARG italic_T end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT ( bold_O ( italic_t ) , italic_ϕ ) , (16)

where T𝑇Titalic_T is the total time steps, CEsubscriptCE\mathcal{L}_{\mathrm{CE}}caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT and MSEsubscriptMSE\mathcal{L}_{\mathrm{MSE}}caligraphic_L start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT denote the cross-entropy loss and mean square error. And 𝒚𝒚\boldsymbol{y}bold_italic_y represents the target label. Moreover, 𝐎(t)𝐎𝑡\mathbf{O}(t)bold_O ( italic_t ) and ϕitalic-ϕ\phiitalic_ϕ are the output and a constant used to regularize the membrane potential distribution. Additionally, we set ϕ=Vthitalic-ϕsubscript𝑉𝑡\phi=V_{th}italic_ϕ = italic_V start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT in our experiments. In practice, we use a hyper-parameter λ𝜆\lambdaitalic_λ to adjust the proportion of the regular term.

Refer to caption

Refer to caption

Figure 6: Schematic diagram of the network structure. On the left is the convolution layer of the VGG network, and on the right is the convolution layer of Resnet.

4.1.3 Network Architectures

To evaluate PFA, we integrate it into VGG-type [47] and ResNet [14] architectures, which are previously utilized in [12; 70; 8; 63]. Figure 6 presents the basic blocks involved in VGG and ResNet architectures. Five PFA modules are inserted into a VGG-type network after each average pooling operation. In the ResNet architecture, PFA is inserted before the residual connection in the last two layers. When splitting the event flow data into frames, we simply integrate the input in the time domain. Particularly, our work is based on the software framework SpikingJelly [10]. Taking notice that frames in CIFAR10-DVS and NCaltech-101 are entirely occupied by their content, we utilized a pooling layer as the first layer of the network to reduce memory consumption.

4.2 Comparison with Existing SOTA Works

Dataset Method Type Architecture Time steps Accuracy CIFAR10/CIFAR100 SIDP  [44]ICLR-2019 Hybrid training ResNet-20/VGGSNN 250/125 95.4/76.4 Diet [43]TNNLS-2021 Spike-based BP ResNet-20 10/5 92.5/64.0 tdBN [68]AAAI-2021 Spike-based BP ResNet-19 6 93.1/71.1 Dspike [34]NeurIPS-2021 Spike-based BP ResNet-18 6 94.2/74.2 NAS-SNN [23]ECCV-2022 Spike-based BP NAS 5 92.7/73.0 TET [8]ICLR-2022 Spike-based BP ResNet-19 6 94.5/74.7 DSR [37]CVPR-2022 Spike-based BP ResNet-19 20 95.4/78.5 Spikformer [69]ICLR-2023 Spike-based BP Spiking-ViT 4 95.5/78.2 PFA Spike-based BP ResNet-19 6 95.7/79.1 4 95.7/78.1 2 95.6/76.7 CIFAR10DVS LIAF-Net  [57]TNNLS-2021 Spike-based BP LIAF-Net 10 70.4 tdBN [68]AAAI-2021 Spike-based BP ResNet-19 10 67.8 TA-SNN  [62]ICCV-2021 Spike-based BP LIAF-Net 10 72.0 Dspike [34]NeurIPS-2021 Spike-based BP VGGSNN 10 75.4 TET [8]ICLR-2022 Spike-based BP VGGSNN 10 83.2 DSR  [37]CVPR-2022 Spike-based BP VGGSNN 10 77.3 TCJA  [70]TNNLS-2024 Spike-based BP VGGSNN 10 80.7 Spikformer [69]ICLR-2023 Spike-based BP Spiking-ViT 16 80.9 PFA Spike-based BP VGGSNN 14 84.0 NCaltech-101 SALT  [22]Neural Netw-2021 Spike-based BP VGGSNN 20 55.0 TCJA  [70]TNNLS-2024 Spike-based BP VGGSNN 14 78.5 PFA Spike-based BP VGGSNN 14 80.5

Table 2: Compare with existing works. Our method improves network performance across all tasks.

To validate the effectiveness of our PFA, we conducted a comparative analysis with several state-of-the-art SNNs, including those employing hybrid training and spike-based backpropagation. The results for both static image data and dynamic data classification tasks are summarized in Table 2. Specifically, the results for the static image dataset are presented at time steps 2, 4, and 6.

For CIFAR10 and CIFAR100 datasets, our PFA outperforms prior works in terms of accuracy. Notably, PFA achieves superior performance on CIFAR10 and CIFAR100, with respective improvements of 1.2% and 4.4% over the TET method [8] which utilizes the same network architecture and time steps. Furthermore, PFA surpasses the DSR method [37] while requiring only 10×\times× fewer time steps. Additionally, our attention-based SNN demonstrates better performance than the Spiking-ViT backbone [69].

Method Time Steps Accuracy
ST-RSBP[66]NIPS-2019NIPS-2019{}^{\text{NIPS-2019}}start_FLOATSUPERSCRIPT NIPS-2019 end_FLOATSUPERSCRIPT 400 90.1
LISNN[6]IJCAI-2020IJCAI-2020{}^{\text{IJCAI-2020}}start_FLOATSUPERSCRIPT IJCAI-2020 end_FLOATSUPERSCRIPT 20 92.1
PLIF[12]ICCV-2021ICCV-2021{}^{\text{ICCV-2021}}start_FLOATSUPERSCRIPT ICCV-2021 end_FLOATSUPERSCRIPT 8 94.4
TCJA[70]TNNLS-2024TNNLS-2024{}^{\text{TNNLS-2024}}start_FLOATSUPERSCRIPT TNNLS-2024 end_FLOATSUPERSCRIPT 8 94.8
PFA (Ours) 8 94.5
Table 3: Accuracy on Fashion-MNIST[59] dataset.

For the Fashion-MNIST dataset, as depicted in Table 3, while the accuracy of our proposed PFA falls slightly short of TCJA [70], it still surpasses previous methodologies utilizing longer time steps [66; 6] or incorporating learnable spike thresholds [12]. We hypothesize that the relatively diminutive image size of the Fashion-MNIST dataset curtails the efficacy of spatial attention. This conjecture aligns with the findings from our ablation experiments, indicating that spatial attention plays a pivotal role in determining PFA performance.

For the dynamic dataset CIFAR10DVS, our method outperforms previous approaches using binary spikes by 3.1%, even with fewer time steps. For the NCaltech101 dataset, we achieved an impressive 80.5% top-1 accuracy, far surpassing the performance of prior work [70].

To further illustrate the advantages of our PFA, we compare it with previous works at various time steps. As depicted in Figure 1, we compare PFA with Spike-based BP SNNs. Notably, our PFA exhibits a significant advantage at shorter simulation time steps, owing to its superior representation capability.

We observe from the perspective of simulated time steps that PFA can achieve satisfactory performance even with a smaller number of simulated time steps. For the CIFAR-10 dataset, simply utilizing 2 simulated time steps with PFA can surpass the performance of previous works, including transformer-like structures such as Spikeformer. On the CIFAR-100 dataset, compared to the top-performing Spikeformer, PFA lags by only 0.1% accuracy with 4 time steps. On DVS datasets like CIFAR-10DVS and NCaltech, employing 14 simulated time steps, PFA significantly outperforms previous methods. On CIFAR10-DVS, PFA surpasses Spikeformer by 3.1%, and compared to the TET [8] method with a similar structure, it achieves a higher accuracy by 0.8%. On the NCaltech-101 dataset, PFA outperforms the previous best TCJA method by 2%.

In summary, PFA offers substantial advantages over previous methods. When applied to attention modules like TCJA [70], PFA significantly enhances resistance to overfitting, leading to improved accuracy on the validation set. Although Spikeformer, which is based on the Spike ViT architecture, performs better than many CNN-based techniques, an especially interesting breakthrough is the incorporation of PFA into traditional CNN architectures like VGG, which can even outperform Spikeformer’s [69] performance.

4.3 Experiments on Image Generation Tasks

Refer to caption
Figure 7: Generated images of ANN VAE, FSVAE[21], and our PFA on CIFAR10 [29] dataset.
Dataset Method IS\uparrow FID\downarrow FAD\downarrow
MNIST[7] ANN[21]AAAI-2022 5.95 112.5 17.09
FSVAE[21]AAAI-2022 6.21 97.06 35.54
TCJA[70]TNNLS-2024 6.45 100.8 19.39
PFA (Ours) 6.97 96.82 15.12
Fashion- MNIST[59] ANN[21]AAAI-2022 4.58 123.7 18.08
FSVAE[21]AAAI-2022 4.55 90.12 15.75
TCJA[70]TNNLS-2024 5.61 93.41 12.46
PFA (Ours) 5.35 97.3 11.97
CIFAR10[29] ANN[21]AAAI-2022 2.59 229.6 196.9
FSVAE[21]AAAI-2022 2.94 175.5 133.9
TCJA[70]TNNLS-2024 3.73 170.1 100.4
PFA (Ours) 3.84 166.4 92.83
Table 4: Comparison with existing methods on image generation tasks.

To further validate the efficacy of PFA, we conduct image generation tasks on MNIST [7], Fashion-MNIST [59], and CIFAR10 [29] using fully spike variation autoencoders [21] (FSVAE) . We integrated our proposed PFA as the first layer of FSVAE while maintaining other settings identical to the original network. The experimental results are outlined in Table 4. Notably, PFA exhibits superior performance metrics on both MNIST [7] and CIFAR10 [29] datasets. Additionally, we visually depict the generated images in Figure 7, further corroborating the enhancements facilitated by PFA.

4.4 Discussion on R𝑅Ritalic_R

Refer to caption CIFAR10
Refer to caption CIFAR100
Refer to caption CIFAR10DVS
Refer to caption NCaltech101
Figure 8: Accuracy on four tested datasets with different R𝑅Ritalic_R. PFA achieves best performance when r=3,1,8,8𝑟3188r=3,1,8,8italic_r = 3 , 1 , 8 , 8 correspondingly.

Figure 8 demonstrates the performance with different R𝑅Ritalic_R. The outcome reveals that for dynamic datasets, the best rank option falls around T2𝑇2\frac{T}{2}divide start_ARG italic_T end_ARG start_ARG 2 end_ARG while for static datasets, the best value of R𝑅Ritalic_R (also known as connecting factor) is very small.

Dynamic datasets exhibit significant variations in content across different time steps. Conversely, static datasets involve the duplication of frames, where each frame is replicated T𝑇Titalic_T times as input to the network. Consequently, this duplication results in the attention map being subjected to a low-rank condition. Considering a comprehensive assessment of both experimental outcomes and theoretical insights, our conclusion asserts that the precision of the attention map in faithfully mirroring the input tensor is unnecessary. Such fidelity does not inherently facilitate the effective concentration on pivotal components. Furthermore, our findings unveil an intriguing phenomenon: even in scenarios where SNNs exhibit considerable sparsity, the rank of the attention map can remain markedly low. This revelation implies the presence of untapped optimization potential within the realm of SNN sparsity.

It is worth noting that on the CIFAR10-DVS dataset, when R=1𝑅1R=1italic_R = 1, the performance of PFA is comparable to TCJA (see Table 2 and does not show significant advantages. This phenomenon strongly confirms our previous statement that TCJA is a "rank-1" attention, a specal case of PFA, highlighting the necessity and advantages of non-"rank-1" attention modules.

4.5 Ablation Study

Refer to caption
Figure 9: Ablation study on PFA layers. The x-axis represents the amount of PFA inserted. The accuracy improves with the insertion of PFA layer.
Refer to caption
Figure 10: Ablation study on attention dimensions. T/C/S denotes temporal/channel/spatial correspondingly. SC denotes spatial-channel joint attention and so on. Full means paying attention to three dimensions at the same time, i.e. a standard PFA module.
Refer to caption
Figure 11: Attention visualization on CIFAR10DVS dataset. Temporal and channel parts are taken from the deep layer of the VGG network. The spatial layer visualizations are extracted from the shallow layer of the network due to the inherent complexity of the information contained in the deeper layers, which may impede human comprehension and hinder the interpretability of our analysis.

In this section, we perform ablation experiments from both global and local perspectives to validate the efficacy of PFA on CIFAR10DVS [29] Initially, we vary the number of added PFA layers in the network to assess the impact on classification accuracy. Subsequently, we set the three projections obtained within the PFA to constant matrices, with each element assigned a value of 1, aiming to eliminate attention to specific dimensions. This helps explore the influence of different attention dimensions on classification results. Finally, for a more comprehensive insight into the influence of PFA on the training process, we conduct ablation experiments specifically involving the insertion of PFA.

We incorporate varying numbers of PFAs into the VGG network and present the results in Figure 9. In this experiment, the value of R𝑅Ritalic_R is held constant at 8. It is evident that the network’s accuracy exhibits a noticeable improvement with more PFA layers. Remarkably, even the addition of a single layer leads to enhanced accuracy in the network.

In order to explore how attention in temporal-channel-spatial dimensions affects classification accuracy, we conduct an ablation by fixing the corresponding projection matrix obtained by LPST to an all-one tensor. As shown in Figure 10, whichever dimension is ablated will adversely affect the results. Ablating temporal and ablating channels have similar effects. This means that PFA treats temporal and channel dimensions equally. However, when we ablate the spatial dimension, the accuracy of the network drops significantly (79.60%) but still outperforms the baseline (77.90%). Whichever dimension we ablate, the performance drops. When we restrict the focus dimensions to only one, namely, spatial, temporal, or channel, we observe another decrease in the network’s performance compared to when attention is distributed across two dimensions. The retention of spatial attention significantly impacts accuracy, followed by temporal attention. This observation highlights that PFA effectively directs attention to distinct regions across multiple dimensions, aligning with the intended purpose of attention modules. Specifically, PFA exhibits attention distribution across all three dimensions, albeit with varying weights. Notably, the spatial dimension emerges as the most significant, followed by the time and channel dimensions, in terms of attention importance.

4.6 Attention Visualization

In order to visually represent the attention distribution of the PFA module, we have generated an attention map in PFA and depicted it in Figure 11. The attention maps corresponding to the temporal and channel dimensions in the figure are obtained from the deep layer of the network. This choice was made because deep layers tend to provide a more accurate reflection of the attention’s role. Conversely, the spatial attention map is obtained from the network’s input layer, where the shallow graphic features are more distinct and conspicuous, facilitating analysis and observation. Figure 11 also illustrates the distinctive attention distributions in the temporal, channel, and spatial dimensions achieved by PFA. This finding aligns with the conclusions drawn from our ablation study, further validating our earlier assertions.

5 Conclusions

In this paper, we introduce a novel attention module named PFA, which combines learning and the generation of a temporal-channel-spatial attention map through CP decomposition. Specifically, we propose the LPST to learn a projection of the input tensor that can extract essential information for AMC to compose the attention map. In the AMC, we utilize the reverse process of CP decomposition to generate the attention map. After analyzing the rank of the raw input tensor and the best option for connecting factor R𝑅Ritalic_R, we propose a principle to search for the best R𝑅Ritalic_R. Remarkably, our findings suggest that even a low rank can lead to outstanding performance. On classification tasks, regardless of the form of the dataset, dynamic or static, PFA achieves the SOTA performance. When we extend our evaluation of PFA to image generation tasks, it still shows its capacity to enhance performance on this task as well. Furthermore, additional ablation studies bolster the rationale behind PFA’s efficacy, while visualizing the attention distribution of PFA offers intuitive insights into its substantial impact.

In the future, we plan to extend the applicability of the PFA model to various demanding intelligent applications and deploy the model on a neuromorphic chip.

References

  • Acar et al. [2011] Evrim Acar, Daniel M. Dunlavy, Tamara G. Kolda, and Morten Mørup. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41–56, March 2011. 10.1016/j.chemolab.2010.08.004. URL https://doi.org/10.1016/j.chemolab.2010.08.004.
  • Babiloni et al. [2023] Francesca Babiloni, Thomas Tanay, Jiankang Deng, Matteo Maggioni, and Stefanos Zafeiriou. Factorized dynamic fully-connected layers for neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1374–1383, 2023.
  • Bohte et al. [2002] Sander M. Bohte, Joost N. Kok, and Han La Poutré. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(1-4):17–37, October 2002. 10.1016/s0925-2312(01)00658-0. URL https://doi.org/10.1016/s0925-2312(01)00658-0.
  • Cai et al. [2022] Wuque Cai, Hongze Sun, Rui Liu, Yan Cui, Jun Wang, Yang Xia, Dezhong Yao, and Daqing Guo. A spatial-channel-temporal-fused attention for spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2209.10837.
  • Chen et al. [2020] Wanli Chen, Xinge Zhu, Ruoqi Sun, Junjun He, Ruiyu Li, Xiaoyong Shen, and Bei Yu. Tensor low-rank reconstruction for semantic segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 52–69, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58520-4.
  • Cheng et al. [2021] Xiang Cheng, Yunzhe Hao, Jiaming Xu, and Bo Xu. Lisnn: improving spiking neural networks with lateral interactions for robust object recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021. ISBN 9780999241165.
  • Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012. 10.1109/MSP.2012.2211477.
  • Deng et al. [2021] Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting. In International Conference on Learning Representations (ICLR), 2021.
  • Diehl et al. [2015] Peter U. Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and Michael Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2015. 10.1109/IJCNN.2015.7280696.
  • Fang et al. [2020] Wei Fang, Yanqi Chen, Jianhao Ding, Ding Chen, Zhaofei Yu, Huihui Zhou, Yonghong Tian, and other contributors. Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020. Accessed: 2023-01-16.
  • Fang et al. [2021a] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056–21069, 2021a.
  • Fang et al. [2021b] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating Learnable Membrane Time Constant To Enhance Learning of Spiking Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2661–2671, 2021b.
  • Håstad [1990] Johan Håstad. Tensor rank is NP-complete. Journal of Algorithms, 11(4):644–654, December 1990. 10.1016/0196-6774(90)90014-6. URL https://doi.org/10.1016/0196-6774(90)90014-6.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. 10.1109/cvpr.2016.90. URL https://doi.org/10.1109/cvpr.2016.90.
  • Hodgkin and Huxley [1952] A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500–544, August 1952. 10.1113/jphysiol.1952.sp004764. URL https://doi.org/10.1113/jphysiol.1952.sp004764.
  • Hu et al. [2018a] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018a.
  • Hu et al. [2018b] Yangfan Hu, Hua** Tang, and Gang Pan. Spiking deep residual network, 2018b. URL https://arxiv.longhoe.net/abs/1805.01352.
  • Hu et al. [2022] Yifan Hu, Yujie Wu, Lei Deng, and Guoqi Li. Advancing deep residual learning by solving the crux of degradation in spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2201.07209.
  • Izhikevich [2003] Eugene M Izhikevich. Simple model of spiking neurons. IEEE Transactions on neural networks, 14(6):1569–1572, 2003.
  • Janzamin et al. [2019] Majid Janzamin, Rong Ge, Jean Kossaifi, and Anima Anandkumar. 2019.
  • Kamata et al. [2022] Hiromichi Kamata, Yusuke Mukuta, and Tatsuya Harada. Fully spiking variational autoencoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7059–7067, 2022.
  • Kim and Panda [2021] Youngeun Kim and Priyadarshini Panda. Optimizing Deeper Spiking Neural Networks for Dynamic Vision Sensing. Neural Networks, 144:686–698, 2021.
  • Kim et al. [2022] Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, and Priyadarshini Panda. Neural architecture search for spiking neural networks. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 36–56. Springer, 2022.
  • Kolda and Bader [2009a] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, August 2009a. 10.1137/07070111x. URL https://doi.org/10.1137/07070111x.
  • Kolda and Bader [2009b] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009b. 10.1137/07070111X. URL https://doi.org/10.1137/07070111X.
  • Kossaifi et al. [2017] Jean Kossaifi, Zachary C. Lipton, Arinbjorn Kolbeinsson, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks, 2017. URL https://arxiv.longhoe.net/abs/1707.08308.
  • Kossaifi et al. [2019] Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, and Maja Pantic. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6059–6068, 2019. URL https://api.semanticscholar.org/CorpusID:214743123.
  • Kossaifi et al. [2020] Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, and Maja Pantic. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020. 10.1109/cvpr42600.2020.00610. URL http://dx.doi.org/10.1109/CVPR42600.2020.00610.
  • [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Lapique [1907] Louis Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal of Physiology and Pathology, 9:620–635, 1907.
  • Lau et al. [2024] Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Systems with Applications, 236:121352, February 2024. ISSN 0957-4174. 10.1016/j.eswa.2023.121352. URL http://dx.doi.org/10.1016/j.eswa.2023.121352.
  • Li et al. [2017] Hongmin Li, Hanchao Liu, ** Shi. CIFAR10-DVS: An event-stream dataset for object classification. Frontiers in Neuroscience, 11, May 2017. 10.3389/fnins.2017.00309. URL https://doi.org/10.3389/fnins.2017.00309.
  • Li et al. [2021] Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 23426–23439, 2021.
  • Liu and Parhi [2023] Xingyi Liu and Keshab K Parhi. Tensor decomposition for model reduction in neural networks: A review [feature]. IEEE Circuits and Systems Magazine, 23(2):8–28, 2023.
  • Maass [1997] Wolfgang Maass. Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997. ISSN 0893-6080. https://doi.org/10.1016/S0893-6080(97)00011-7. URL https://www.sciencedirect.com/science/article/pii/S0893608097000117.
  • Meng et al. [2022] Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo. Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation. ArXiv preprint arXiv:2205.00459, 2022.
  • Mnih et al. [2014] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention, 2014. URL https://arxiv.longhoe.net/abs/1406.6247.
  • Novikov et al. [2015] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. Tensorizing neural networks. In Neural Information Processing Systems, 2015. URL https://api.semanticscholar.org/CorpusID:290242.
  • Orchard et al. [2015] Garrick Orchard, A**kya Jayawant, Gregory K. Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in Neuroscience, 9, November 2015. 10.3389/fnins.2015.00437. URL https://doi.org/10.3389/fnins.2015.00437.
  • Qiu et al. [2023] Xue-Rui Qiu, Zhao-Rui Wang, Zheng Luan, Rui-Jie Zhu, Xiao Wu, Ma-Lu Zhang, and Liang-Jian Deng. Vtsnn: A virtual temporal spiking neural network. Frontiers in neuroscience, 17:1091097, 2023.
  • Qiu et al. [2024] Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-jian Deng, and Guoqi Li. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 601–610, 2024.
  • Rathi and Roy [2021] Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021.
  • [44] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International Conference on Learning Representations (ICLR).
  • Rueckauer et al. [2017] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in Neuroscience, 11, December 2017. 10.3389/fnins.2017.00682. URL https://doi.org/10.3389/fnins.2017.00682.
  • Sidiropoulos et al. [2017] Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017. 10.1109/TSP.2017.2690524.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
  • Wang et al. [2023] Qingsong Wang, Zehui Liu, Chunfeng Cui, and Deren Han. Inertial accelerated sgd algorithms for solving large-scale lower-rank tensor cp decomposition problems. Journal of Computational and Applied Mathematics, 423:114948, 2023.
  • Wang et al. [2022] Yuchen Wang, Malu Zhang, Yi Chen, and Hong Qu. Signed neuron with memory: Towards simple, accurate and high-efficient ann-snn conversion. In International Joint Conference on Artificial Intelligence, 2022.
  • Wei et al. [2023] Wenjie Wei, Malu Zhang, Hong Qu, Ammar Belatreche, Jian Zhang, and Hong Chen. Temporal-coded spiking neural networks with dynamic firing threshold: Learning with event-driven backpropagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10552–10562, 2023.
  • Wei et al. [2024] Wenjie Wei, Malu Zhang, Jilin Zhang, Ammar Belatreche, Jibin Wu, Zi**g Xu, Xuerui Qiu, Hong Chen, Yang Yang, and Haizhou Li. Event-driven learning for spiking neural networks. arXiv preprint arXiv:2403.00270, 2024.
  • Wu et al. [2021a] Jibin Wu, Yansong Chua, Malu Zhang, Guoqi Li, Haizhou Li, and Kay Chen Tan. A tandem learning rule for effective training and rapid inference of deep spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021a.
  • Wu et al. [2021b] Jibin Wu, Chenglin Xu, Xiao Han, Daquan Zhou, Malu Zhang, Haizhou Li, and Kay Chen Tan. Progressive tandem learning for pattern recognition with deep spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7824–7840, 2021b.
  • Wu et al. [2018] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Lu** Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12, May 2018. 10.3389/fnins.2018.00331. URL https://doi.org/10.3389/fnins.2018.00331.
  • Wu et al. [2019a] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct training for spiking neural networks: Faster, larger, better. Proceedings of the AAAI Conference on Artificial Intelligence, 33:1311–1318, July 2019a. 10.1609/aaai.v33i01.33011311. URL https://doi.org/10.1609/aaai.v33i01.33011311.
  • Wu et al. [2019b] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct Training for Spiking Neural Networks: Faster, Larger, Better. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1311–1318, 2019b. 10.1609/aaai.v33i01.33011311.
  • Wu et al. [2021c] Zhenzhi Wu, Hehui Zhang, Yihan Lin, Guoqi Li, Meng Wang, and Ye Tang. LIAF-Net: Leaky Integrate and Analog Fire Network for Lightweight and Efficient Spatiotemporal Information Processing. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2021c. 10.1109/TNNLS.2021.3073016.
  • Wu et al. [2022] Zhong-Cheng Wu, Ting-Zhu Huang, Liang-Jian Deng, Hong-Xia Dou, and Deyu Meng. Tensor wheel decomposition and its tensor completion application. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27008–27020. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/acbfe708197ff78ad04cc1beb1710979-Paper-Conference.pdf.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747, 2017. URL https://api.semanticscholar.org/CorpusID:702279.
  • Xu et al. [2023] Zhihao Xu, Zhiqiang Lv, Benjia Chu, and Jianbo Li. Fast autoregressive tensor decomposition for online real-time traffic flow prediction. Knowledge-Based Systems, 282:111125, December 2023. ISSN 0950-7051. 10.1016/j.knosys.2023.111125. URL http://dx.doi.org/10.1016/j.knosys.2023.111125.
  • Yang and Hospedales [2016] Yongxin Yang and Timothy M. Hospedales. Deep multi-task representation learning: A tensor factorisation approach. ArXiv, abs/1605.06391, 2016. URL https://api.semanticscholar.org/CorpusID:3047732.
  • Yao et al. [2021] Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October 2021. 10.1109/iccv48922.2021.01006. URL https://doi.org/10.1109/iccv48922.2021.01006.
  • Yao et al. [2023] Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Zhan et al. [2023] Qiugang Zhan, Guisong Liu, Xiurui Xie, Malu Zhang, and Guolin Sun. Bio-inspired active learning method in spiking neural network. Knowledge-Based Systems, 261:110193, February 2023. ISSN 0950-7051. 10.1016/j.knosys.2022.110193. URL http://dx.doi.org/10.1016/j.knosys.2022.110193.
  • Zhang et al. [2021] Malu Zhang, Jiadong Wang, Jibin Wu, Ammar Belatreche, Burin Amornpaisannon, Zhixuan Zhang, Venkata Pavan Kumar Miriyala, Hong Qu, Yansong Chua, Trevor E Carlson, et al. Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks. IEEE transactions on neural networks and learning systems, 33(5):1947–1958, 2021.
  • Zhang and Li [2019] Wenrui Zhang and Peng Li. Spike-train level backpropagation for training deep recurrent spiking neural networks. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • Zheng et al. [2020] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks, 2020. URL https://arxiv.longhoe.net/abs/2011.05280.
  • Zheng et al. [2021] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going Deeper With Directly-Trained Larger Spiking Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11062–11070, 2021.
  • Zhou et al. [2023] Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=frE4fUwz_h.
  • Zhu et al. [2022] Rui-Jie Zhu, Qihang Zhao, Tian**g Zhang, Haoyu Deng, Yule Duan, Malu Zhang, and Liang-Jian Deng. Tcja-snn: Temporal-channel joint attention for spiking neural networks, 2022. URL https://arxiv.longhoe.net/abs/2206.10177.