Direct Training Needs Regularisation:
Anytime Optimal Inference Spiking Neural Network

Dengyu Wu1, Yi Qi1, Kaiwen Cai1, Gaojie **2, ** Yi3, Xiaowei Huang1
University of Liverpool, Liverpool, UK1
State Key Laboratory of Computer Science, Institute of Software, CAS, Bei**g, China2
Southeast University, Nan**g, China3
{dengyu.wu, xiaowei.huang}@liverpool.ac.uk
Abstract

Spiking Neural Network (SNN) is acknowledged as the next generation of Artificial Neural Network (ANN) and hold great promise in effectively processing spatial-temporal information. However, the choice of timestep becomes crucial as it significantly impacts the accuracy of the neural network training. Specifically, a smaller timestep indicates better performance in efficient computing, resulting in reduced latency and operations. While, using a small timestep may lead to low accuracy due to insufficient information presentation with few spikes. This observation motivates us to develop an SNN that is more reliable for adaptive timestep by introducing a novel regularisation technique, namely Spatial-Temporal Regulariser (STR). Our approach regulates the ratio between the strength of spikes and membrane potential at each timestep. This effectively balances spatial and temporal performance during training, ultimately resulting in an Anytime Optimal Inference (AOI) SNN. Through extensive experiments on frame-based and event-based datasets, our method, in combination with cutoff based on softmax output, achieves state-of-the-art performance in terms of both latency and accuracy. Notably, with STR and cutoff, SNN achieves 2.142.142.142.14 to 2.892.892.892.89 faster in inference compared to the pre-configured timestep with near-zero accuracy drop of 0.50%percent0.500.50\%0.50 % to 0.64%percent0.640.64\%0.64 % over the event-based datasets. Code available: https://github.com/Dengyu-Wu/AOI-SNN-Regularisation

1 Introduction

Spiking Neural Network (SNN) aims to mimic the behavior of biological neurons in the brain, efficiently processing spatial-temporal information through the use of their inherent dynamics, such as integration and firing progress Maass (1997); Rueckauer et al. (2017); Pfeiffer and Pfeil (2018); Wu et al. (2022). For instance, the integrated membrane potential of SNN retains information from previous timestep and enables effective processing of temporal information Yao et al. (2021); Yin et al. (2021). Similarly, the generated spikes activate post neurons, allowing them to efficiently propagate the current information through the network, where neurons are triggered sparsely upon receiving spikes. This activation mechanism differs from Artificial Neural Network (ANN) that relies on dense multiplications for forward propagation. In SNN, neurons are only activated when receiving spikes which leads to sparse and remarkably efficient computations. Given this unique characteristic of SNN, they are particularly well-suited for implementation on emerging neuromorphic hardware platforms, such as TrueNorth Akopyan et al. (2015), Loihi Davies et al. (2018), and Tianji Pei et al. (2019), which have empowered SNN to leverage their inherent event-driven nature at the hardware level. This development holds great promise for enabling energy-efficient applications, such as real-time audio denoising Timcheck et al. (2023), low-power gesture recognition Amir et al. (2017) and robotic control Tang et al. (2021).

The rapid progress in SNN has been fueled by the pursuit of energy-efficient and high-performance computing solutions. In the field of neuromorphic computing, the primary focus of algorithm optimisation for SNN has been on improving accuracy Datta et al. (2022); Kulkarni and Rajendran (2018); Taherkhani et al. (2020). From the perspective of the characteristics of SNN, the total inference timestep determines their computing efficiency Rueckauer et al. (2017); Wu et al. (2022). Thus, efforts to reduce the inference complexity of SNN while maintaining accuracy have been ongoing. Techniques such as optimising SNN through training Deng et al. (2022); Duan et al. (2022); Bu et al. (2022) have shown promising results in enhancing the computing efficiency of SNN. Despite these advancements, there is still room for exploring a more adaptive and flexible inference process, as the current method primarily focuses on optimising SNN for pre-configured timestep. Lately, the concept of anytime inference for SNN has garnered increasing attention, as evidenced by recent works Wu et al. (2023); Li et al. (2023b); Chen et al. (2023); Li et al. (2023a). This growing interest highlights a novel direction of efficient computing for SNN. In the meanwhile, Wu et al. (2023) suggested that optimising SNN through both training and inference aspects helps achieve Anytime Optimal Inference (AOI). Precisely, a regularisation technique was introduced to train SNN to be optimal for anytime inference.

However, during SNN training, there emerges a delicate balance between optimising the current timestep and considering its potential impact on subsequent ones. As SNN predictions are interconnected, concentrating on minimising loss at one timestep might inadvertently lead to increased loss at the others. This balance highlights the intricate nature of optimising SNN for anytime inference. Simply optimising the average output Wu et al. (2019); Fang et al. (2021a, b) no longer suffices for achieving anytime optimality, as it lacks constraints at each timestep. While the approach of temporal efficient training Deng et al. (2022) comes close to an AOI model by aligning predictions at each timestep closer to the ground truth, it still relies on the average loss across timestep and grapples with the challenge of harmonising the trade-off between spatial and temporal performance.

In this paper, we are interested in optimising SNN for anytime inference through direct training. To achieve this, we introduce a novel regularisation technique that diminishes the influence of present timestep to next timestep, thereby yielding SNN capable of providing more reliable predictions across the timesteps. Our key contributions include:

  • Introducing the concept of the spatial-temporal factor that helps understand the contributions of spatial and temporal information in SNN.

  • Proposing a regularisation technique that dynamically adjusts the spatial-temporal factor during training for enhancing the accuracy at present timestep.

  • Validating our approach with extensive experiments, including uncertainty estimation and cutoff results.

Through these contributions, we aim to build a more efficient and accurate SNN for anytime inference. This entails achieving a lower average timestep for the SNN while concurrently maintaining a high level of accuracy.

2 Related work

Recent research has extensively explored to reduce the inherent complexity associated with inference processes of SNN. A significant focus within this research landscape is to train SNN that operates at small timestep Deng and Gu (2021); Li et al. (2021a); Bu et al. (2022); Duan et al. (2022). Another growing avenue involves the study of adaptive timesteps, providing an alternative to reducing computing operations by lowering average timesteps Wu et al. (2023); Li et al. (2023a); Chen et al. (2023). Both paths exploit the sparsity and dynamics of SNN to achieve efficient computing.

Spiking Network Training

One such direction involves optimising the training of SNN to achieve better efficiency. For instance, reducing the timestep during inference can significantly improve computing efficiency, as the total timestep determine the overall computational operations. This has been achieved by adding temporal batch normalisation Zheng et al. (2021); Duan et al. (2022), improving surrogate gradient Wu et al. (2018, 2019); Neftci et al. (2019); Li et al. (2021a), optimising loss function for temporal training Deng et al. (2022), and minimising the distance between ANN and SNN activation for ANN-to-SNN conversion Deng and Gu (2021); Wu et al. (2022); Bu et al. (2022). Another line of investigation focuses on exploring the efficient architecture for SNN, such as designing novel spike-based architectures Fang et al. (2021a); Zhou et al. (2023) and deploying Network Architecture Search (NAS) in SNN Kim et al. (2022a, b). In addition, quantisation techniques Schaefer and Joshi (2020); Putra and Shafique (2021); Li et al. (2022a), which aim to convert resource-intensive floating-point operations into more efficient integer operations, have also been explored to enhance the efficiency of SNN. Furthermore, Lu and Sengupta (2020) argues that SNN can further benefit from sequential and binarised activation to improve binary network accuracy.

Anytime Optimal Inference

In the realm of SNN, the exploration of anytime inference is still in its early stages, with relatively limited attention garnered thus far. Nonetheless, a few notable studies have begun to delve into enhancing anytime inference in SNN. For example, Li et al. (2023b) introduced an auxiliary network to predict confidence for the early exiting. Similarly, Chen et al. (2023) studied output distribution and integrated conformal prediction Angelopoulos and Bates (2021) for adaptive inference. For conversion-based SNN, Li et al. (2023a) calibrated output confidence across the timesteps, while Wu et al. (2023) suggest that gap value between the first and second largest of outputs can efficiently predict the cutoff time. While these studies efficiently trigger anytime inference in SNN, they have focused on addressing data uncertainty over the inference rather than optimising uncertainty within the SNN model itself.

3 Preliminary

In this section, we introduce the neuron model and direct training of SNN. To facilitate the analysis, we use bold symbol to represent vector, l𝑙litalic_l to denote the layer index, and i𝑖iitalic_i to denote the index of elements. For example, 𝑾lsuperscript𝑾𝑙\boldsymbol{W}^{l}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is weight matrix at the l𝑙litalic_l-th layer. t𝑡titalic_t denote discrete timestep.

3.1 Leaky Integrate-and-Fire model

Refer to caption
Figure 1: (a) Forward propagation in SNN. The input events 𝑿(t)𝑿𝑡\boldsymbol{X}(t)bold_italic_X ( italic_t ) stimulate neurons to generate spikes over the time. The output f(𝑿(t))𝑓𝑿𝑡f(\boldsymbol{X}(t))italic_f ( bold_italic_X ( italic_t ) ) can respond when a sufficient number of events are received within a specific time window. 𝜽(t)𝜽𝑡\boldsymbol{\theta}(t)bold_italic_θ ( italic_t ) and τ𝚫(t)𝜏𝚫𝑡\tau\boldsymbol{\Delta}(t)italic_τ bold_Δ ( italic_t ) represents the spatial and temporal information at t𝑡titalic_t, respectively. (b) The state update of one LIF neuron at input layer during forward propagation process. The weight Wilsuperscriptsubscript𝑊𝑖𝑙W_{i}^{l}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT influences the current contributing to the membrane potential Vil(t)superscriptsubscript𝑉𝑖𝑙𝑡V_{i}^{l}(t)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ). The threshold Vthrlsubscriptsuperscript𝑉𝑙𝑡𝑟V^{l}_{thr}italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT determines the threshold level of Vil(t)superscriptsubscript𝑉𝑖𝑙𝑡V_{i}^{l}(t)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) required to generate the spikes.

Leaky integrate-and-fire (LIF) model is widely adopted in the study of SNN, due to its simplicity and biological plausibility. The forward propagation in SNN is shown in Figure 1. The iterative equation of LIF model in forward propagation can be expressed as follows:

𝑽l(t)=τl𝚫l(t1)+𝒁l(t),superscript𝑽𝑙𝑡superscript𝜏𝑙superscript𝚫𝑙𝑡1superscript𝒁𝑙𝑡\boldsymbol{V}^{l}(t)=\tau^{l}\boldsymbol{\Delta}^{l}(t-1)+\boldsymbol{Z}^{l}(% t),bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t - 1 ) + bold_italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) , (1)

where 𝑽l(t)superscript𝑽𝑙𝑡\boldsymbol{V}^{l}(t)bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) represents the membrane potential at layer l𝑙litalic_l and time step t𝑡titalic_t prior spike firing, τlsuperscript𝜏𝑙\tau^{l}italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT denotes decay factor, and 𝚫l(t)=(1𝜽(t))𝑽l(t)superscript𝚫𝑙𝑡1𝜽𝑡superscript𝑽𝑙𝑡\boldsymbol{\Delta}^{l}(t)=(1-\boldsymbol{\theta}(t))\cdot\boldsymbol{V}^{l}(t)bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = ( 1 - bold_italic_θ ( italic_t ) ) ⋅ bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) is the residual current after spike firing, i.e., θil(t)=1superscriptsubscript𝜃𝑖𝑙𝑡1\theta_{i}^{l}(t)=1italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = 1 if Vil(t)Vthrlsuperscriptsubscript𝑉𝑖𝑙𝑡subscriptsuperscript𝑉𝑙𝑡𝑟V_{i}^{l}(t)\geq V^{l}_{thr}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ≥ italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT and θil(t)=0superscriptsubscript𝜃𝑖𝑙𝑡0\theta_{i}^{l}(t)=0italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = 0 otherwise. Furthermore, 𝒁l(t)superscript𝒁𝑙𝑡\boldsymbol{Z}^{l}(t)bold_italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) denotes current input and is defined as:

𝒁l(t)=𝑾l𝜽l1(t)+𝒃l when l>1.formulae-sequencesuperscript𝒁𝑙𝑡superscript𝑾𝑙superscript𝜽𝑙1𝑡superscript𝒃𝑙 when 𝑙1\displaystyle\boldsymbol{Z}^{l}(t)=\boldsymbol{W}^{l}\boldsymbol{\theta}^{l-1}% (t)+\boldsymbol{b}^{l}\hskip 14.22636pt\text{ when }l>1.bold_italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( italic_t ) + bold_italic_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT when italic_l > 1 . (2)

According to different inputs, 𝒁l(t)superscript𝒁𝑙𝑡\boldsymbol{Z}^{l}(t)bold_italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) at the first layer, i.e., 𝒁1(t)superscript𝒁1𝑡\boldsymbol{Z}^{1}(t)bold_italic_Z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ), can be initialised as:

𝒁t1={𝑾1𝑿(t)+𝒃1event-based input𝑾1𝑿¯+𝒃1frame-based input,subscriptsuperscript𝒁1𝑡casessuperscript𝑾1𝑿𝑡superscript𝒃1event-based inputsuperscript𝑾1bold-¯𝑿superscript𝒃1frame-based input\boldsymbol{Z}^{1}_{t}=\left\{\begin{array}[]{ll}\boldsymbol{W}^{1}\boldsymbol% {X}(t)+\boldsymbol{b}^{1}&\textit{event-based input}\\ \boldsymbol{W}^{1}\boldsymbol{\bar{X}}+\boldsymbol{b}^{1}&\textit{frame-based % input},\end{array}\right.bold_italic_Z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL bold_italic_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT bold_italic_X ( italic_t ) + bold_italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL event-based input end_CELL end_ROW start_ROW start_CELL bold_italic_W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT overbold_¯ start_ARG bold_italic_X end_ARG + bold_italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL frame-based input , end_CELL end_ROW end_ARRAY (3)

where 𝑿(t)𝑿𝑡\boldsymbol{X}(t)bold_italic_X ( italic_t ) is the integration of events at t𝑡titalic_t-th timestep and 𝑿¯bold-¯𝑿\boldsymbol{\bar{X}}overbold_¯ start_ARG bold_italic_X end_ARG represents the constant current stimulus to the first layer that equals to the analogue values of input. Note that for frame-based input, 𝒁1(t)superscript𝒁1𝑡\boldsymbol{Z}^{1}(t)bold_italic_Z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) is the same at different t𝑡titalic_t. To simplify the analysis, we simply use event-based input as our objective in the following sections.

3.2 Direct Training

Our approach leverages the backpropagation method to train SNN directly. This strategy considers the states of spiking neuron at each timestep during training and has demonstrated the potential to achieve high-performance SNN, particularly when operating with small timesteps. A prevalent optimisation objective aims to minimise the distance between the average output and ground truth, as explored in prior works Fang et al. (2021a, b); Duan et al. (2022). The loss function is defined as:

Lmean=Lce(1TtTf(𝑿(t)),𝒚),subscript𝐿𝑚𝑒𝑎𝑛subscript𝐿𝑐𝑒1𝑇superscriptsubscript𝑡𝑇𝑓𝑿𝑡𝒚L_{mean}=L_{ce}(\frac{1}{T}\sum_{t}^{T}f(\boldsymbol{X}(t)),\boldsymbol{y}),italic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( bold_italic_X ( italic_t ) ) , bold_italic_y ) , (4)

where T𝑇Titalic_T is the maximum timestep, Lcesubscript𝐿𝑐𝑒L_{ce}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT represents cross entropy loss, f()𝑓f(\cdot)italic_f ( ⋅ ) is the SNN model, and f(𝑿(t))𝑓𝑿𝑡f(\boldsymbol{X}(t))italic_f ( bold_italic_X ( italic_t ) ) denotes the synaptic current output at t𝑡titalic_t-th timestep, and 𝒚𝒚\boldsymbol{y}bold_italic_y is the ground truth. However, recent work Deng et al. (2022) proposes an alternative approach, Temporal Efficient Training (TET), to train SNN. TET suggests that employing the average of cross entropy over all timesteps can lead to improved SNN performance, described as:

LTET=1TtTLce(f(𝑿(t)),𝒚).subscript𝐿𝑇𝐸𝑇1𝑇superscriptsubscript𝑡𝑇subscript𝐿𝑐𝑒𝑓𝑿𝑡𝒚L_{TET}=\frac{1}{T}\sum_{t}^{T}L_{ce}(f(\boldsymbol{X}(t)),\boldsymbol{y}).italic_L start_POSTSUBSCRIPT italic_T italic_E italic_T end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_f ( bold_italic_X ( italic_t ) ) , bold_italic_y ) . (5)

Since the firing progress is non-differentiable, surrogate gradient methods, such as linear type Esser et al. (2015); Wu et al. (2018, 2019) and non-linear type Zenke and Ganguli (2018); Li et al. (2021a); Shrestha and Orchard (2018), are employed in direct training.

Refer to caption
Figure 2: Comparison of accuracy with respect to timestep using different loss functions on Cifar10-DVS.

Training Methods and AOI

Figure 2 presents accuracy results across all timesteps using two loss functions, e.g., Lmeansubscript𝐿𝑚𝑒𝑎𝑛L_{mean}italic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and LTETsubscript𝐿𝑇𝐸𝑇L_{TET}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_T end_POSTSUBSCRIPT, on Cifar10-DVS Li et al. (2017). The training of these two models follows the same strategy specified in Section 5.2. It is not surprising that Lmeansubscript𝐿𝑚𝑒𝑎𝑛L_{mean}italic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT exhibits limited capability in achieving AOI during inference. For example, the accuracy of Lmeansubscript𝐿𝑚𝑒𝑎𝑛L_{mean}italic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT experiences a significant drop when the timestep is small. This type of training does not train each timestep to yield accurate predictions; instead, it prioritises minimising the loss based on the average output, typically computed at the last timestep. Acknowledging the reliability of TET in achieving AOI, we regularise SNN training through TET in our study.

4 Method

In this section, we present the details of the two-fold approach that helps SNN to achieve AOI. Firstly, we introduce the Spatial-Temporal Factor (STF) to help better understand how SNN utilise information over different timesteps during inference. Secondly, we propose the Spatial-Temporal Regulariser (STR) to encourage SNN to prioritise the present timestep rather than relying solely on the next timestep to achieve minimal loss during training.

4.1 Spatial-Temporal Factor

To gain a deeper understanding of the forward propagation process of SNN, we decompose the vector 𝑽l(t)superscript𝑽𝑙𝑡\boldsymbol{V}^{l}(t)bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) into two orthogonal components. This decomposition facilitates a detailed analysis of the individual contributions made by these components, unveiling the underlying mechanisms involved in information processing within the network. By dissecting the vector 𝑽l(t)superscript𝑽𝑙𝑡\boldsymbol{V}^{l}(t)bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) in this manner, we can explore how each component influences the dynamics and transformations of information in the SNN. Mathematically, the decomposition is formulated as:

𝑽l(t)=Vthrl𝜽l(t)+τ𝚫l(t),superscript𝑽𝑙𝑡superscriptsubscript𝑉𝑡𝑟𝑙superscript𝜽𝑙𝑡𝜏superscript𝚫𝑙𝑡\boldsymbol{V}^{l}(t)=V_{thr}^{l}\boldsymbol{\theta}^{l}(t)+\tau\boldsymbol{% \Delta}^{l}(t),bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = italic_V start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) + italic_τ bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) , (6)

where the clipped value of 𝑽l(t)superscript𝑽𝑙𝑡\boldsymbol{V}^{l}(t)bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) is ignored as it does not contribute to either the current or the next timestep.

Building on this decomposition, we introduce the Spatial-Temporal Factor (STF), symbolised as ξl(t)superscript𝜉𝑙𝑡\xi^{l}(t)italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ), to assess the interplay of Vthrl𝑽l(t)superscriptsubscript𝑉𝑡𝑟𝑙superscript𝑽𝑙𝑡V_{thr}^{l}\boldsymbol{V}^{l}(t)italic_V start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) and τ𝚫l(t)𝜏superscript𝚫𝑙𝑡\tau\boldsymbol{\Delta}^{l}(t)italic_τ bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) on the network. Specifically, the STF is defined as the ratio between the L2 norm of these two elements, offering an approximate estimation of their impacts on the current or future states of the network. This is encapsulated in the following expression:

ξl(t)=α~𝜽l(t)2𝚫l(t)2,superscript𝜉𝑙𝑡~𝛼subscriptdelimited-∥∥superscript𝜽𝑙𝑡2subscriptdelimited-∥∥superscript𝚫𝑙𝑡2\xi^{l}(t)=\tilde{\alpha}\frac{\lVert\boldsymbol{\theta}^{l}(t)\rVert_{2}}{% \lVert\boldsymbol{\Delta}^{l}(t)\rVert_{2}},italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = over~ start_ARG italic_α end_ARG divide start_ARG ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (7)

where 2subscriptdelimited-∥∥2\lVert\cdot\rVert_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the L2 norm, i.e., 𝒙2=ixi2subscriptdelimited-∥∥𝒙2subscript𝑖superscriptsubscript𝑥𝑖2\lVert\boldsymbol{x}\rVert_{2}=\sqrt{\sum_{i}x_{i}^{2}}∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. We define α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG as a consolidation of the constants Vthrlsuperscriptsubscript𝑉𝑡𝑟𝑙V_{thr}^{l}italic_V start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and τ𝜏\tauitalic_τ, which streamlines the equation and focuses attention on the relationship between 𝜽l(t)2subscriptdelimited-∥∥superscript𝜽𝑙𝑡2\lVert\boldsymbol{\theta}^{l}(t)\rVert_{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝚫l(t)2subscriptdelimited-∥∥superscript𝚫𝑙𝑡2\lVert\boldsymbol{\Delta}^{l}(t)\rVert_{2}∥ bold_Δ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Given that α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG is a constant, it can be seamlessly integrated into the hyper-parameter α𝛼\alphaitalic_α in Equation 11. This formulation allows us to focus on the optimisation objective that encompasses both spatial and temporal information within a single term. The influence of regularisation on Equation 7 is visually demonstrated in Figure 3, with a detailed exploration of the regularisation technique presented in Section 4.2.

4.2 Spatial-Temporal Regularisation

Refer to caption
Figure 3: Visualisation of STF ξl(t)superscript𝜉𝑙𝑡\xi^{l}(t)italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) at 8-th layer in two models, before and after regularisation, on Cifar10-DVS. Regularisation leads to a reduction in variance of ξl(t)superscript𝜉𝑙𝑡\xi^{l}(t)italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) from 0.0025 to 0.0022, indicating enhanced stability across timestep. Additionally, the mean value of ξl(t)superscript𝜉𝑙𝑡\xi^{l}(t)italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) rises from 0.2736 to 0.3336, reflecting an enhancement in the representation of spatial information θl(t)2subscriptdelimited-∥∥superscript𝜃𝑙𝑡2\lVert\theta^{l}(t)\rVert_{2}∥ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

This section delves into the design of a regulariser based on the STF. Our prior analysis explained that TET prompts the network to minimise average loss without considering the sequential input order, leading to uncertain predictions at each timestep. To address this, we propose increasing the STF to decouple the influence of temporal information from network training. Nonetheless, this endeavor poses challenges, particularly in determining appropriate STF values for each layer without compromising accuracy. The right side of Figure 4 shows the STF distribution across different layers for both correct and wrong predictions. On the left side, we provide additional insights based on average values of STF with respect to the layer index. One noticeable observation is the progressive increase in STF as the layers grow deeper. Additionally, our experimental findings uncover a noteworthy discrepancy in STF values between correct and wrong predictions, particularly in the deeper layers. This observation indicates that wrong predictions tend to demonstrate lower STF values within these deeper layers. This intriguing phenomenon suggests that the SNN is actively involved in dampening spike occurrences when faced with difficult inputs.

Refer to captionRefer to caption
(a) Baseline STF
Refer to captionRefer to caption
(b) Regularised STF
Figure 4: Comparision of average STF ξ¯lsuperscript¯𝜉𝑙\bar{\xi}^{l}over¯ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (over timestep) on Cifar10-DVS in distinguishing correct vs. wrong predictions. (a) illustrates the ξ¯lsuperscript¯𝜉𝑙\bar{\xi}^{l}over¯ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT using the baseline method, while (b) showcases the ξ¯lsuperscript¯𝜉𝑙\bar{\xi}^{l}over¯ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT after applying the regularisation technique. In both cases, the left side shows the distribution of STF over different layers, and the right side displays the average STF values for correct and wrong predictions, alongside the gap value between them.

To increase STF without sacrificing the accuracy, our goal is to eliminate the worst case – the STF is relatively small – during training. Thus, our regularisation only considers correct predictions during training, achieved by masking ξl(t)superscript𝜉𝑙𝑡\xi^{l}(t)italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) as follows:

ξ~l(t)={ξl(t)if correct prediction0else.superscript~𝜉𝑙𝑡casessuperscript𝜉𝑙𝑡if correct prediction0else\tilde{\xi}^{l}(t)=\left\{\begin{array}[]{ll}\xi^{l}(t)&\textit{if correct % prediction}\\ 0&\textit{else}.\end{array}\right.over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) = { start_ARRAY start_ROW start_CELL italic_ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) end_CELL start_CELL if correct prediction end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else . end_CELL end_ROW end_ARRAY (8)

Next, we assume that in an AOI-SNN, each timestep should contribute equally. To formalise this concept, we define 𝚵lsuperscript𝚵𝑙\boldsymbol{\Xi}^{l}bold_Ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as a set of ξ~l(t)superscript~𝜉𝑙𝑡\tilde{\xi}^{l}(t)over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_t ) with all possible timesteps, expressed as:

𝚵l=[ξ~l(1),ξ~l(2),,ξ~l(T)].superscript𝚵𝑙superscriptsuperscript~𝜉𝑙1superscript~𝜉𝑙2superscript~𝜉𝑙𝑇top\boldsymbol{\Xi}^{l}=[\tilde{\xi}^{l}(1),\tilde{\xi}^{l}(2),...,\tilde{\xi}^{l% }(T)]^{\top}.bold_Ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( 1 ) , over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( 2 ) , … , over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_T ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (9)

Then, STR is formulated as:

R(𝚵l)=(ξ~minlξ~maxl)2,𝑅superscript𝚵𝑙superscriptsubscriptsuperscript~𝜉𝑙𝑚𝑖𝑛subscriptsuperscript~𝜉𝑙𝑚𝑎𝑥2R(\boldsymbol{\Xi}^{l})=(\tilde{\xi}^{l}_{min}-\tilde{\xi}^{l}_{max})^{2},italic_R ( bold_Ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) = ( over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT - over~ start_ARG italic_ξ end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where ξ~minlsuperscriptsubscript~𝜉𝑚𝑖𝑛𝑙\tilde{\xi}_{min}^{l}over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and ξ~maxlsuperscriptsubscript~𝜉𝑚𝑎𝑥𝑙\tilde{\xi}_{max}^{l}over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are the minimal and maximum values, respectively, in a mini-batch of 𝚵lsuperscript𝚵𝑙\boldsymbol{\Xi}^{l}bold_Ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Note that both values are non-zero so that the incorrect samples can be excluded during regularisation. We set ξ~maxsubscript~𝜉𝑚𝑎𝑥\tilde{\xi}_{max}over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT to a relatively optimal value, because it is generally large while still ensuring correct predictions. To consider the STR across the total L𝐿Litalic_L layers and adjust the loss function with the hyper-parameter α𝛼\alphaitalic_α, the final objective function becomes as:

LTET+αlLR(𝚵l),subscript𝐿𝑇𝐸𝑇𝛼superscriptsubscript𝑙𝐿𝑅superscript𝚵𝑙L_{TET}+\alpha\sum_{l}^{L}R(\boldsymbol{\Xi}^{l}),italic_L start_POSTSUBSCRIPT italic_T italic_E italic_T end_POSTSUBSCRIPT + italic_α ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_R ( bold_Ξ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , (11)

5 Experiment

In this section, we evaluate the effectiveness of our approach using uncertainty and synaptic operations as additional metrics alongside accuracy. Extensive experiments are conducted on frame-based and event-based datasets.

5.1 Evaluation Metrics

Uncertainty Estimation

It is desired that the resulting SNN are always certain about its predictions at any timestep. Thus, we use the variance of predictions as a metric to evaluate the level of uncertainty. To quantify the uncertainty in prediction, we utilise the widely used ensemble method Lakshminarayanan et al. (2017); Fort et al. (2019) as it offers a straightforward approach for quantifying prediction uncertainty. Specifically, we build an ensemble of SNN models, each trained from different weight initialisations. Such randomness in weight initialisations will lead the model to various solutions in the loss landscape, and therefore, the variance of predictions from the ensemble members reflect the uncertainty in predictions. Ensemble members are trained in parallel as they do not interact with each other. And during inference, the final prediction 𝝁(t)𝝁𝑡\boldsymbol{\mu}(t)bold_italic_μ ( italic_t ) is the mean of predictions of all ensemble members:

𝝁(t)=1MiMfi(𝑿(t)),𝝁𝑡1𝑀subscriptsuperscript𝑀𝑖subscript𝑓𝑖𝑿𝑡\boldsymbol{\mu}(t)=\frac{1}{M}\sum^{M}_{i}f_{i}(\boldsymbol{X}(t)),bold_italic_μ ( italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X ( italic_t ) ) , (12)

where M𝑀Mitalic_M is the number of members in the ensemble, t𝑡titalic_t is the timestep. Then, the uncertainty or variance of predictions at each timestep is calculated as:

σ2(t)=1MiMfi(𝑿(t))𝝁(t)2,superscript𝜎2𝑡1𝑀subscriptsuperscript𝑀𝑖subscriptdelimited-∥∥subscript𝑓𝑖𝑿𝑡𝝁𝑡2\sigma^{2}(t)=\frac{1}{M}\sum^{M}_{i}\lVert f_{i}(\boldsymbol{X}(t))-% \boldsymbol{\mu}(t)\rVert_{2},italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X ( italic_t ) ) - bold_italic_μ ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (13)

where larger σ2(t)superscript𝜎2𝑡\sigma^{2}(t)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) implies higher uncertainty.

Synaptic Operations

The energy efficiency of neuromorphic hardware can be characterised by the energy consumption of single synaptic operation Merolla et al. (2014). Thus, we follow Rueckauer et al. (2017); Wu et al. (2022) to measure the synaptic operations from simulation for energy consumption estimation, described as:

Synaptic Operations:tTlLfoutlsl,:Synaptic Operationssuperscriptsubscript𝑡𝑇superscriptsubscript𝑙𝐿subscriptsuperscript𝑓𝑙𝑜𝑢𝑡superscript𝑠𝑙\displaystyle\textit{Synaptic Operations}:\sum_{t}^{T}\sum_{l}^{L}f^{l}_{out}s% ^{l},Synaptic Operations : ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , (14)

where foutlsubscriptsuperscript𝑓𝑙𝑜𝑢𝑡f^{l}_{out}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is the number of output connections and slsuperscript𝑠𝑙s^{l}italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the average number of spikes per neuron of the l𝑙litalic_l-th layer.

5.2 Experiment Setup

We evaluate SNN on ResNet-19 Fang et al. (2021a); Deng et al. (2022) for Cifar10/100 Krizhevsky et al. (2009), Sew-ResNet-34 Fang et al. (2021a) for ImageNet Russakovsky et al. (2015), VGGSNN Deng et al. (2022) for Cifar10-DVS Li et al. (2017) and N-Caltech101 Orchard et al. (2015), 5-layer convoluational network Fang et al. (2021b) for DVS128 Gesture Amir et al. (2017). Instead of rescaling the input, we adopt the approach from Wu et al. (2023), in which we incorporate a downscaling layer prior to the network. Specifically, for Cifar10-DVS and N-Caltech101, we add a convolutional layer comprising 64 filters, a kernel size of 8, and strides of 4. The number of filters is increased to 128 for DVS128 Gesture. This modification allows the events to be directly fed into the SNN while preserving the event-driven features.

We employ Stochastic Gradient Descent (SGD) with an initial learning rate of 0.1 and weight decay of 5e-4 for all datasets. The training epochs are set to 300, 120, and 100 for Cifar10/100, ImageNet, and event-based datasets, respectively. The learning rate is decayed to zero at the end of training using the cosine decay schedule Loshchilov and Hutter (2016). For data augmentation, we use autoaugmentation Cubuk et al. (2019) and cutout DeVries and Taylor (2017) for Cifar10/100 and pixel shifting for event-based inputs, i.e., both width and height are randomly shifted by the range [-20%,20%]. Dropout is applied after fully-connected layer for DVS128 Gesture to improve the training and the dropout rate is 0.2. In the results, we use ‘TET’ to present the baseline Deng et al. (2022) and ‘STR(\cdot)’ to denote our method with the setting of α𝛼\alphaitalic_α in the bracket. We follow ’TET’ to adopt the surrogate method described in Esser et al. (2015). Since the direct training is extraordinary expense for the training time, we train 3 models for ImageNet and 5 models for the other datasets with different seeds so that M𝑀Mitalic_M is 3 or 5 for Equation 12 and 13.

5.3 Uncertainty Results

Refer to caption
(a) Cifar10
Refer to caption
(b) Cifar100
Refer to caption
(c) ImageNet
Refer to caption
(d) Cifar10-DVS
Refer to caption
(e) N-Caltech101
Refer to caption
(f) DVS128 Gesture
Figure 5: Comparison of uncertainty with respect to timestep on six datasets. The assessment of uncertainty is conducted on various models with distinct settings of α𝛼\alphaitalic_α, such as {0.05, 0.1} for Cifar10/100 and {0.3, 0.5} for event-based inputs. Due to the expensive training for ImageNet, we solely evaluate the setting of {0.05}.

Figure 5 depicts the estimated uncertainty of predictions on different datasets, which reveals interesting patterns of uncertainty trends as time evolve. Specifically, it can be observed that the predictions at initial timestep tend to exhibit large uncertainty on most datasets, and then gradually reduces. This shows predictions becomes more reliable as the timestep increases. But we also notice that such trend is not obvious on the N-Caltech101 dataset, as shown in Figure 5(e). This is because the N-Caltech101 dataset has more events, providing more useful information for classification. For example, N-Caltech101 has an average of 5230 spikes per second for the input, while Cifar10-DVS only has 85.38 spikes per second. By incorporating STR, we observe a significant decrease in uncertainty in predictions, which implies improved stability and reliability in the predictions throughout the temporal sequence. Table 1 summarises the average uncertainty and the accuracy achieved at the last timestep over all datasets. The results indicate that training with STR consistently decreases uncertainty while maintaining accuracy or even achieving higher accuracy for event-based datasets.

Table 1: Comparison of average uncertainty and accuracy at the last timestep on both frame-based and event-based datasets. The optimal α𝛼\alphaitalic_α for each model is selected based on Figure 5, prioritising relatively small variance while preserving accuracy.
  Dataset Method T Avg. σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Avg. Acc. (%)
  Cifar10 TET 4 0.0222 95.40 ±plus-or-minus\pm± 0.05
STR(0.1) 4 0.0217 95.42 ±plus-or-minus\pm± 0.04
Cifar100 TET 4 0.0507 78.31 ±plus-or-minus\pm± 0.14
STR(0.05) 4 0.0506 78.37 ±plus-or-minus\pm± 0.22
ImageNet TET 4 0.0373 67.46 ±plus-or-minus\pm± 0.02
STR(0.05) 4 0.0370 67.54 ±plus-or-minus\pm± 0.03
  Cifar10-DVS TET 10 0.1061 82.38 ±plus-or-minus\pm± 0.59
STR(0.5) 10 0.0941 82.64 ±plus-or-minus\pm± 0.44
N-Caltech101 TET 10 0.0456 84.84 ±plus-or-minus\pm± 0.41
STR(0.5) 10 0.0435 85.91 ±plus-or-minus\pm± 0.54
DVS128 Gesture TET 16 0.0511 97.80 ±plus-or-minus\pm± 0.37
STR(0.5) 16 0.0453 98.26 ±plus-or-minus\pm± 0.30
 

5.4 Cutoff Results

As previously highlighted, using STR can be effective in reducing the variance across the timesteps, especially on event-based dataset in which the maximum timestep is relatively large for training. Further insights into the comparison between TET and STR are illustrated in Figure 6, considering two different inference types – one with fixed timestep and the other one with cutoff mechanism. The label ‘w/ cutoff’ signifies results with cutoff. While the curve without cutoff has often been utilised to find a balance between timesteps and accuracy Han et al. (2020); Wu et al. (2022), it has a drawback of fixing the timestep during inference, leading to a notable decline in accuracy when the timestep is small. As TET trains SNN to predict at each timestep, we directly apply softmax-based cutoff on the resulted SNN models for anytime inference. Precisely, the SNN is cutoff when the maximum softmax score at the output surpasses the predetermined threshold.

Refer to caption
(a) Cifar10
Refer to caption
(b) Cifar100
Refer to caption
(c) ImageNet
Refer to caption
(d) Cifar10-DVS
Refer to caption
(e) N-Caltech101
Refer to caption
(f) DVS128 Gesture
Figure 6: Comparison of accuracy with respect to timestep on six datasets. Each STR-based model employs the same α𝛼\alphaitalic_α setting from Table 1. The results present the accuracy performance using fixed timesteps and cutoff.

Figure 6 presents accuracy with respect to a range of cutoff threshold varies from [0.99 to 1.0] for the DVS128 Gesture, and [0.8 to 1.0] for the other datasets. In both instances, the threshold range is divided into 20 discrete values, each having an equal interval between them. It shows that with cutoff all models have significant decrease on the latency while maintaining the accuracy. Compared to frame-based input (e.g., Figure 6(a) to 6(c)), the enhancement from STR in event-based input (e.g., Figure 6(d) to 6(f)) is more substantial. This is attributed to the sparser nature and greater uncertainty in predictions associated with event input. In contrast, frame-based input data furnishes more information at each timestep, aiding in the prediction.

Table 2: Comparison with the exiting works on frame-based datasets in regard to both accuracy and latency.
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Li et al. (2021a) ResNet-18 93.13±plus-or-minus\pm±0.07 2
Zheng et al. (2021) ResNet-19 92.92 4
Yao et al. (2022) ResNet-19 94.44±plus-or-minus\pm±0.10 2
Duan et al. (2022) ResNet-19 95.45 2
  [email protected] ResNet-19 95.23 ±plus-or-minus\pm± 0.05 1.26
[email protected] 95.32±plus-or-minus\pm± 0.07 1.26
[email protected] ResNet-19 95.40±plus-or-minus\pm± 0.05 1.70
[email protected] 95.42 ±plus-or-minus\pm± 0.05 1.67
 
(a) Cifar10
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Li et al. (2021a) ResNet-18 71.68±plus-or-minus\pm±0.12 2
Deng et al. (2022) ResNet-19 72.87±plus-or-minus\pm±0.10 2
Yao et al. (2022) ResNet-19 75.48±plus-or-minus\pm±0.08 2
Duan et al. (2022) ResNet-19 78.07 2
  [email protected] ResNet-19 78.25 ±plus-or-minus\pm± 0.17 2.42
[email protected] 78.34±plus-or-minus\pm± 0.21 2.40
[email protected] ResNet-19 78.31±plus-or-minus\pm± 0.14 3.63
[email protected] 78.37 ±plus-or-minus\pm± 0.22 3.51
 
(b) Cifar100
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Bu et al. (2022) ResNet-34 59.35 16
Meng et al. (2022) ResNet-34 67.05 6
Fang et al. (2021a) SewResNet-34 67.04 4
Duan et al. (2022) SewResNet-34 68.28 4
  [email protected] SewResNet-34 67.40 ±plus-or-minus\pm± 0.02 2.70
[email protected] 67.46±plus-or-minus\pm± 0.04 2.70
[email protected] SewResNet-34 67.46±plus-or-minus\pm± 0.02 3.60
[email protected] 67.54 ±plus-or-minus\pm± 0.03 3.60
 
(c) ImageNet
Table 3: Comparison with the exiting works on event-based datasets in regard to both accuracy and latency. Note: * to indicate the network with downscaling layer.
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Fang et al. (2021a) Wide-7B-Net 74.40 16
Meng et al. (2022) ResNet-19 67.80 10
Li et al. (2021a) ResNet-18 75.4±plus-or-minus\pm±0.05 10
Li et al. (2022b) VGG-11 81.70 10
  [email protected] VGGSNN* 78.24 ±plus-or-minus\pm± 0.99 1.71
[email protected] 79.70 ±plus-or-minus\pm± 0.20 1.88
[email protected] VGGSNN* 80.96±plus-or-minus\pm± 0.50 3.01
[email protected] 82.08 ±plus-or-minus\pm± 0.23 3.46
 
(a) Cifar10-DVS
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Li et al. (2021b) Graph 76.10 -
Messikommer et al. (2020) VGG-13 74.50 -
Li et al. (2022b) VGG-11 83.70 10
  [email protected] VGGSNN* 84.07 ±plus-or-minus\pm± 0.16 2.98
[email protected] 85.27 ±plus-or-minus\pm± 0.30 2.97
[email protected] VGGSNN* 84.77±plus-or-minus\pm± 0.17 4.41
[email protected] 85.80 ±plus-or-minus\pm± 0.26 4.37
 
(b) N-Caltech101
  Methods Architecture Avg. Acc. (%) Avg. T𝑇Titalic_T
  Yao et al. (2021) 3-layer 98.61 60
Zheng et al. (2021) ResNet-17 96.87 40
Fang et al. (2021b) 5-layer 97.46 20
Fang et al. (2021a) 7B-Net 97.92 16
  [email protected] 5-layer* 94.16 ±plus-or-minus\pm± 0.92 4.18
[email protected] 96.29 ±plus-or-minus\pm± 0.73 4.18
[email protected] 5-layer* 96.74±plus-or-minus\pm± 0.37 7.85
[email protected] 97.73 ±plus-or-minus\pm± 0.54 7.46
 
(c) DVS128 Gesture

For a comprehensive comparison with state-of-the-art (SOTA) methods, we summarise accuracy data based on different cutoff thresholds using the format method@cutoff threshold in Table 2 and 3. Note that our objective is not to achieve higher accuracy than SOTA models, but rather to achieve comparable accuracy while reducing the average timesteps required for inference. By examining the results in Table 2 and 3, it is evident that our STR approach consistently achieves competitive latency with the SOTA and baseline. Notably, when employing these techniques, SNN showcases a remarkable acceleration in inference times. With STR and cutoff, SNN achieves 2.142.142.142.14 to 2.892.892.892.89 times faster in inference compared to SNN presented in Table 1, which uses a fixed timestep. This enhanced efficiency is achieved with a near-zero accuracy drop of 0.50%percent0.500.50\%0.50 % to 0.64%percent0.640.64\%0.64 % over the event-based datasets.

Figure 7, which illustrates accuracy relative to synaptic operations, providing clear evidence that cutoff significantly diminishes the number of synaptic operations required. It should be noted that the count of synaptic operations for Cifar10/100 excludes the first layers, which is a traditional ANN layer and function as a spike encoder. STR may introduce additional spikes to enhance accuracy. For instance, in the case of DVS128 Gesture, the number of synaptic operations is higher for STR compared to TET. However, when comparing reduced synaptic operations from cutoff, the increment is marginal.

Refer to caption
Figure 7: Comparison of accuracy with respect to synaptic operations. The cutoff threshold is set to {0.99, 1.0, \infty} for DVS128 Gesture and {0.9, 1.0, \infty} for the others, i.e., smaller marker indicates lower cutoff threshold and \infty means no cutoff.

Following the presentation of our experimental results, it is pertinent to contextualise these findings within the broader scope of related work, specifically in comparison to SEENN Li et al. (2023b). Our baseline ‘TET’, akin to SEENN-I, is a direct implementation of the methods outlined in Deng et al. (2022) and with utilising the confidence score for the cutoff. Thus, this prompts us to focus the comparison over our baseline, which shares identical training settings. Comparing with SEENN-II, which requires an additional network to trigger the cutoff, our STR focus on optimising the SNN itself, maintaining its original structure and enhancing performance in cutoff. This approach provides an alternative path in SNN advancements, distinct from and not in conflict with external strategy like SEENN-II.

6 Conclusion

In light of the approach presented in our work, we have demonstrated the effectiveness of STR in enhancing the reliability of SNN for anytime inference scenarios. Combining with the cutoff mechanism, our approach further enhances the performance metrics like accuracy and latency, highlighting the comprehensive improvement potential of our novel STR technique in the SNN landscape.

References

  • Akopyan et al. (2015) Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems, 34(10):1537–1557, 2015.
  • Amir et al. (2017) Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017.
  • Angelopoulos and Bates (2021) Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • Bu et al. (2022) Tong Bu, Wei Fang, Jianhao Ding, PENGLIN DAI, Zhaofei Yu, and Tiejun Huang. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. In International Conference on Learning Representations, 2022.
  • Chen et al. (2023) Jiechen Chen, Sangwoo Park, and Osvaldo Simeone. Spikecp: Delay-adaptive reliable spiking neural networks via conformal prediction. arXiv preprint arXiv:2305.11322, 2023.
  • Cubuk et al. (2019) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
  • Datta et al. (2022) Gourav Datta, Souvik Kundu, Akhilesh R Jaiswal, and Peter A Beerel. Ace-snn: Algorithm-hardware co-design of energy-efficient & low-latency deep spiking neural networks for 3d image recognition. Frontiers in neuroscience, 16:815258, 2022.
  • Davies et al. (2018) Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro, 38(1):82–99, 2018.
  • Deng and Gu (2021) Shikuang Deng and Shi Gu. Optimal conversion of conventional artificial neural networks to spiking neural networks. In International Conference on Learning Representations, 2021.
  • Deng et al. (2022) Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal efficient training of spiking neural network via gradient re-weighting. In International Conference on Learning Representations, 2022.
  • DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • Duan et al. (2022) Chaoteng Duan, Jianhao Ding, Shiyan Chen, Zhaofei Yu, and Tiejun Huang. Temporal effective batch normalization in spiking neural networks. In Advances in Neural Information Processing Systems, 2022.
  • Esser et al. (2015) Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dharmendra S Modha. Backpropagation for energy-efficient neuromorphic computing. Advances in neural information processing systems, 28, 2015.
  • Fang et al. (2021a) Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056–21069, 2021a.
  • Fang et al. (2021b) Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2661–2671, 2021b.
  • Fort et al. (2019) Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  • Han et al. (2020) B. Han, G. Srinivasan, and K. Roy. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13558–13567, 2020.
  • Kim et al. (2022a) Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, and Priyadarshini Panda. Neural architecture search for spiking neural networks. In European Conference on Computer Vision, pages 36–56. Springer, 2022a.
  • Kim et al. (2022b) Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, and Priyadarshini Panda. Exploring lottery ticket hypothesis in spiking neural networks. In European Conference on Computer Vision, pages 102–120. Springer, 2022b.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
  • Kulkarni and Rajendran (2018) Shruti R. Kulkarni and Bipin Rajendran. Spiking neural networks for handwritten digit recognition—supervised learning and network optimization. Neural Networks, 103:118–127, 2018.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Li et al. (2022a) Chen Li, Lei Ma, and Steve Furber. Quantization framework for fast spiking neural networks. Frontiers in Neuroscience, 16:918793, 2022a.
  • Li et al. (2023a) Chen Li, Edward Jones, and Steve Furber. Unleashing the potential of spiking neural networks by dynamic confidence. arXiv preprint arXiv:2303.10276, 2023a.
  • Li et al. (2017) Hongmin Li, Hanchao Liu, ** Shi. Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience, 11:309, 2017.
  • Li et al. (2021a) Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable spike: Rethinking gradient-descent for training spiking neural networks. Advances in Neural Information Processing Systems, 34:23426–23439, 2021a.
  • Li et al. (2021b) Yi** Li, Han Zhou, Bangbang Yang, Ye Zhang, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. Graph-based asynchronous event processing for rapid object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 934–943, 2021b.
  • Li et al. (2022b) Yuhang Li, Youngeun Kim, Hyoungseob Park, Tamar Geller, and Priyadarshini Panda. Neuromorphic data augmentation for training spiking neural networks. In European Conference on Computer Vision, pages 631–649. Springer, 2022b.
  • Li et al. (2023b) Yuhang Li, Tamar Geller, Youngeun Kim, and Priyadarshini Panda. Seenn: Towards temporal spiking early-exit neural networks. arXiv preprint arXiv:2304.01230, 2023b.
  • Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • Lu and Sengupta (2020) Sen Lu and Abhronil Sengupta. Exploring the connection between binary and spiking neural networks. Frontiers in Neuroscience, 14:535, 2020.
  • Maass (1997) Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997.
  • Meng et al. (2022) Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo. Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12444–12453, 2022.
  • Merolla et al. (2014) P.A. Merolla, J.V. Arthur, R. Alvarez-Icaza, A.S. Cassidy, J. Sawada, F. Akopyan, B.L. Jackson, N. Imam, C. Guo, Y. Nakamura, and B. Brezzo. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
  • Messikommer et al. (2020) Nico Messikommer, Daniel Gehrig, Antonio Loquercio, and Davide Scaramuzza. Event-based asynchronous sparse convolutional networks. In European Conference on Computer Vision, pages 415–431. Springer, 2020.
  • Neftci et al. (2019) Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
  • Orchard et al. (2015) Garrick Orchard, A**kya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9:437, 2015.
  • Pei et al. (2019) **g Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019.
  • Pfeiffer and Pfeil (2018) Michael Pfeiffer and Thomas Pfeil. Deep learning with spiking neurons: Opportunities and challenges. Frontiers in neuroscience, 12:774, 2018.
  • Putra and Shafique (2021) Rachmad Vidya Wicaksana Putra and Muhammad Shafique. Q-spinn: A framework for quantizing spiking neural networks. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  • Rueckauer et al. (2017) Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience, 11:682, 2017.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Schaefer and Joshi (2020) Clemens JS Schaefer and Siddharth Joshi. Quantizing spiking neural networks with integers. In International Conference on Neuromorphic Systems 2020, pages 1–8, 2020.
  • Shrestha and Orchard (2018) Sumit B Shrestha and Garrick Orchard. Slayer: Spike layer error reassignment in time. Advances in neural information processing systems, 31, 2018.
  • Taherkhani et al. (2020) Aboozar Taherkhani, Georgina Cosma, and Thomas Martin McGinnity. Optimization of output spike train encoding for a spiking neuron based on its spatio–temporal input pattern. IEEE Transactions on Cognitive and Developmental Systems, 12(3):427–438, 2020.
  • Tang et al. (2021) Guangzhi Tang, Neelesh Kumar, Raymond Yoo, and Konstantinos Michmizos. Deep reinforcement learning with population-coded spiking neural network for continuous control. In Conference on Robot Learning, pages 2016–2029. PMLR, 2021.
  • Timcheck et al. (2023) Jonathan Timcheck, Sumit Bam Shrestha, Daniel Ben Dayan Rubin, Adam Kupryjanow, Garrick Orchard, Lukasz Pindor, Timothy Shea, and Mike Davies. The intel neuromorphic dns challenge. Neuromorphic Computing and Engineering, 3(3):034005, 2023.
  • Wu et al. (2022) Dengyu Wu, ** Yi, and Xiaowei Huang. A little energy goes a long way: Build an energy-efficient, accurate spiking neural network from convolutional neural network. Frontiers in neuroscience, 16, 2022.
  • Wu et al. (2023) Dengyu Wu, Gaojie **, Han Yu, ** Yi, and Xiaowei Huang. Optimising event-driven spiking neural network with regularisation and cutoff. arXiv preprint arXiv:2301.09522, 2023.
  • Wu et al. (2018) Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Lu** Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
  • Wu et al. (2019) Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence, pages 1311–1318, 2019.
  • Yao et al. (2021) Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10221–10230, 2021.
  • Yao et al. (2022) Xingting Yao, Fanrong Li, Zitao Mo, and Jian Cheng. Glif: A unified gated leaky integrate-and-fire neuron for spiking neural networks. Advances in Neural Information Processing Systems, 35:32160–32171, 2022.
  • Yin et al. (2021) Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905–913, 2021.
  • Zenke and Ganguli (2018) Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.
  • Zheng et al. (2021) Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI conference on artificial intelligence, pages 11062–11070, 2021.
  • Zhou et al. (2023) Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations, 2023.