Direct Training Needs Regularisation:
Anytime Optimal Inference Spiking Neural Network

Dengyu Wu¹, Yi Qi¹, Kaiwen Cai¹, Gaojie **², ** Yi³, Xiaowei Huang¹
University of Liverpool, Liverpool, UK¹
State Key Laboratory of Computer Science, Institute of Software, CAS, Bei**g, China²
Southeast University, Nan**g, China³
{dengyu.wu, xiaowei.huang}@liverpool.ac.uk

Abstract

Spiking Neural Network (SNN) is acknowledged as the next generation of Artificial Neural Network (ANN) and hold great promise in effectively processing spatial-temporal information. However, the choice of timestep becomes crucial as it significantly impacts the accuracy of the neural network training. Specifically, a smaller timestep indicates better performance in efficient computing, resulting in reduced latency and operations. While, using a small timestep may lead to low accuracy due to insufficient information presentation with few spikes. This observation motivates us to develop an SNN that is more reliable for adaptive timestep by introducing a novel regularisation technique, namely Spatial-Temporal Regulariser (STR). Our approach regulates the ratio between the strength of spikes and membrane potential at each timestep. This effectively balances spatial and temporal performance during training, ultimately resulting in an Anytime Optimal Inference (AOI) SNN. Through extensive experiments on frame-based and event-based datasets, our method, in combination with cutoff based on softmax output, achieves state-of-the-art performance in terms of both latency and accuracy. Notably, with STR and cutoff, SNN achieves $2.14$ to $2.89$ faster in inference compared to the pre-configured timestep with near-zero accuracy drop of $0.50\%$ to $0.64\%$ over the event-based datasets. Code available: https://github.com/Dengyu-Wu/AOI-SNN-Regularisation

1 Introduction

Spiking Neural Network (SNN) aims to mimic the behavior of biological neurons in the brain, efficiently processing spatial-temporal information through the use of their inherent dynamics, such as integration and firing progress Maass (1997); Rueckauer et al. (2017); Pfeiffer and Pfeil (2018); Wu et al. (2022). For instance, the integrated membrane potential of SNN retains information from previous timestep and enables effective processing of temporal information Yao et al. (2021); Yin et al. (2021). Similarly, the generated spikes activate post neurons, allowing them to efficiently propagate the current information through the network, where neurons are triggered sparsely upon receiving spikes. This activation mechanism differs from Artificial Neural Network (ANN) that relies on dense multiplications for forward propagation. In SNN, neurons are only activated when receiving spikes which leads to sparse and remarkably efficient computations. Given this unique characteristic of SNN, they are particularly well-suited for implementation on emerging neuromorphic hardware platforms, such as TrueNorth Akopyan et al. (2015), Loihi Davies et al. (2018), and Tianji Pei et al. (2019), which have empowered SNN to leverage their inherent event-driven nature at the hardware level. This development holds great promise for enabling energy-efficient applications, such as real-time audio denoising Timcheck et al. (2023), low-power gesture recognition Amir et al. (2017) and robotic control Tang et al. (2021).

The rapid progress in SNN has been fueled by the pursuit of energy-efficient and high-performance computing solutions. In the field of neuromorphic computing, the primary focus of algorithm optimisation for SNN has been on improving accuracy Datta et al. (2022); Kulkarni and Rajendran (2018); Taherkhani et al. (2020). From the perspective of the characteristics of SNN, the total inference timestep determines their computing efficiency Rueckauer et al. (2017); Wu et al. (2022). Thus, efforts to reduce the inference complexity of SNN while maintaining accuracy have been ongoing. Techniques such as optimising SNN through training Deng et al. (2022); Duan et al. (2022); Bu et al. (2022) have shown promising results in enhancing the computing efficiency of SNN. Despite these advancements, there is still room for exploring a more adaptive and flexible inference process, as the current method primarily focuses on optimising SNN for pre-configured timestep. Lately, the concept of anytime inference for SNN has garnered increasing attention, as evidenced by recent works Wu et al. (2023); Li et al. (2023b); Chen et al. (2023); Li et al. (2023a). This growing interest highlights a novel direction of efficient computing for SNN. In the meanwhile, Wu et al. (2023) suggested that optimising SNN through both training and inference aspects helps achieve Anytime Optimal Inference (AOI). Precisely, a regularisation technique was introduced to train SNN to be optimal for anytime inference.

However, during SNN training, there emerges a delicate balance between optimising the current timestep and considering its potential impact on subsequent ones. As SNN predictions are interconnected, concentrating on minimising loss at one timestep might inadvertently lead to increased loss at the others. This balance highlights the intricate nature of optimising SNN for anytime inference. Simply optimising the average output Wu et al. (2019); Fang et al. (2021a, b) no longer suffices for achieving anytime optimality, as it lacks constraints at each timestep. While the approach of temporal efficient training Deng et al. (2022) comes close to an AOI model by aligning predictions at each timestep closer to the ground truth, it still relies on the average loss across timestep and grapples with the challenge of harmonising the trade-off between spatial and temporal performance.

In this paper, we are interested in optimising SNN for anytime inference through direct training. To achieve this, we introduce a novel regularisation technique that diminishes the influence of present timestep to next timestep, thereby yielding SNN capable of providing more reliable predictions across the timesteps. Our key contributions include:

•

Introducing the concept of the spatial-temporal factor that helps understand the contributions of spatial and temporal information in SNN.
•

Proposing a regularisation technique that dynamically adjusts the spatial-temporal factor during training for enhancing the accuracy at present timestep.
•

Validating our approach with extensive experiments, including uncertainty estimation and cutoff results.

Through these contributions, we aim to build a more efficient and accurate SNN for anytime inference. This entails achieving a lower average timestep for the SNN while concurrently maintaining a high level of accuracy.

2 Related work

Recent research has extensively explored to reduce the inherent complexity associated with inference processes of SNN. A significant focus within this research landscape is to train SNN that operates at small timestep Deng and Gu (2021); Li et al. (2021a); Bu et al. (2022); Duan et al. (2022). Another growing avenue involves the study of adaptive timesteps, providing an alternative to reducing computing operations by lowering average timesteps Wu et al. (2023); Li et al. (2023a); Chen et al. (2023). Both paths exploit the sparsity and dynamics of SNN to achieve efficient computing.

Spiking Network Training

One such direction involves optimising the training of SNN to achieve better efficiency. For instance, reducing the timestep during inference can significantly improve computing efficiency, as the total timestep determine the overall computational operations. This has been achieved by adding temporal batch normalisation Zheng et al. (2021); Duan et al. (2022), improving surrogate gradient Wu et al. (2018, 2019); Neftci et al. (2019); Li et al. (2021a), optimising loss function for temporal training Deng et al. (2022), and minimising the distance between ANN and SNN activation for ANN-to-SNN conversion Deng and Gu (2021); Wu et al. (2022); Bu et al. (2022). Another line of investigation focuses on exploring the efficient architecture for SNN, such as designing novel spike-based architectures Fang et al. (2021a); Zhou et al. (2023) and deploying Network Architecture Search (NAS) in SNN Kim et al. (2022a, b). In addition, quantisation techniques Schaefer and Joshi (2020); Putra and Shafique (2021); Li et al. (2022a), which aim to convert resource-intensive floating-point operations into more efficient integer operations, have also been explored to enhance the efficiency of SNN. Furthermore, Lu and Sengupta (2020) argues that SNN can further benefit from sequential and binarised activation to improve binary network accuracy.

Anytime Optimal Inference

In the realm of SNN, the exploration of anytime inference is still in its early stages, with relatively limited attention garnered thus far. Nonetheless, a few notable studies have begun to delve into enhancing anytime inference in SNN. For example, Li et al. (2023b) introduced an auxiliary network to predict confidence for the early exiting. Similarly, Chen et al. (2023) studied output distribution and integrated conformal prediction Angelopoulos and Bates (2021) for adaptive inference. For conversion-based SNN, Li et al. (2023a) calibrated output confidence across the timesteps, while Wu et al. (2023) suggest that gap value between the first and second largest of outputs can efficiently predict the cutoff time. While these studies efficiently trigger anytime inference in SNN, they have focused on addressing data uncertainty over the inference rather than optimising uncertainty within the SNN model itself.

3 Preliminary

In this section, we introduce the neuron model and direct training of SNN. To facilitate the analysis, we use bold symbol to represent vector, $l$ to denote the layer index, and $i$ to denote the index of elements. For example, $\boldsymbol{W}^{l}$ is weight matrix at the $l$ -th layer. $t$ denote discrete timestep.

3.1 Leaky Integrate-and-Fire model

Refer to caption — Figure 1: (a) Forward propagation in SNN. The input events $\boldsymbol{X}(t)$ stimulate neurons to generate spikes over the time. The output $f(\boldsymbol{X}(t))$ can respond when a sufficient number of events are received within a specific time window. $\boldsymbol{\theta}(t)$ and $\tau\boldsymbol{\Delta}(t)$ represents the spatial and temporal information at $t$ , respectively. (b) The state update of one LIF neuron at input layer during forward propagation process. The weight $W_{i}^{l}$ influences the current contributing to the membrane potential $V_{i}^{l}(t)$ . The threshold $V^{l}_{thr}$ determines the threshold level of $V_{i}^{l}(t)$ required to generate the spikes.

Leaky integrate-and-fire (LIF) model is widely adopted in the study of SNN, due to its simplicity and biological plausibility. The forward propagation in SNN is shown in Figure 1. The iterative equation of LIF model in forward propagation can be expressed as follows:

\boldsymbol{V}^{l}(t)=\tau^{l}\boldsymbol{\Delta}^{l}(t-1)+\boldsymbol{Z}^{l}(% t),

(1)

where $\boldsymbol{V}^{l}(t)$ represents the membrane potential at layer $l$ and time step $t$ prior spike firing, $\tau^{l}$ denotes decay factor, and $\boldsymbol{\Delta}^{l}(t)=(1-\boldsymbol{\theta}(t))\cdot\boldsymbol{V}^{l}(t)$ is the residual current after spike firing, i.e., $\theta_{i}^{l}(t)=1$ if $V_{i}^{l}(t)\geq V^{l}_{thr}$ and $\theta_{i}^{l}(t)=0$ otherwise. Furthermore, $\boldsymbol{Z}^{l}(t)$ denotes current input and is defined as:

\displaystyle\boldsymbol{Z}^{l}(t)=\boldsymbol{W}^{l}\boldsymbol{\theta}^{l-1}% (t)+\boldsymbol{b}^{l}\hskip 14.22636pt\text{ when }l>1.

(2)

According to different inputs, $\boldsymbol{Z}^{l}(t)$ at the first layer, i.e., $\boldsymbol{Z}^{1}(t)$ , can be initialised as:

\boldsymbol{Z}^{1}_{t}=\left\{\begin{array}[]{ll}\boldsymbol{W}^{1}\boldsymbol% {X}(t)+\boldsymbol{b}^{1}&\textit{event-based input}\\ \boldsymbol{W}^{1}\boldsymbol{\bar{X}}+\boldsymbol{b}^{1}&\textit{frame-based % input},\end{array}\right.

(3)

where $\boldsymbol{X}(t)$ is the integration of events at $t$ -th timestep and $\boldsymbol{\bar{X}}$ represents the constant current stimulus to the first layer that equals to the analogue values of input. Note that for frame-based input, $\boldsymbol{Z}^{1}(t)$ is the same at different $t$ . To simplify the analysis, we simply use event-based input as our objective in the following sections.

3.2 Direct Training

Our approach leverages the backpropagation method to train SNN directly. This strategy considers the states of spiking neuron at each timestep during training and has demonstrated the potential to achieve high-performance SNN, particularly when operating with small timesteps. A prevalent optimisation objective aims to minimise the distance between the average output and ground truth, as explored in prior works Fang et al. (2021a, b); Duan et al. (2022). The loss function is defined as:

L_{mean}=L_{ce}(\frac{1}{T}\sum_{t}^{T}f(\boldsymbol{X}(t)),\boldsymbol{y}),

(4)

where $T$ is the maximum timestep, $L_{ce}$ represents cross entropy loss, $f(\cdot)$ is the SNN model, and $f(\boldsymbol{X}(t))$ denotes the synaptic current output at $t$ -th timestep, and $\boldsymbol{y}$ is the ground truth. However, recent work Deng et al. (2022) proposes an alternative approach, Temporal Efficient Training (TET), to train SNN. TET suggests that employing the average of cross entropy over all timesteps can lead to improved SNN performance, described as:

L_{TET}=\frac{1}{T}\sum_{t}^{T}L_{ce}(f(\boldsymbol{X}(t)),\boldsymbol{y}).

(5)

Since the firing progress is non-differentiable, surrogate gradient methods, such as linear type Esser et al. (2015); Wu et al. (2018, 2019) and non-linear type Zenke and Ganguli (2018); Li et al. (2021a); Shrestha and Orchard (2018), are employed in direct training.

Training Methods and AOI

Figure 2 presents accuracy results across all timesteps using two loss functions, e.g., $L_{mean}$ and $L_{TET}$ , on Cifar10-DVS Li et al. (2017). The training of these two models follows the same strategy specified in Section 5.2. It is not surprising that $L_{mean}$ exhibits limited capability in achieving AOI during inference. For example, the accuracy of $L_{mean}$ experiences a significant drop when the timestep is small. This type of training does not train each timestep to yield accurate predictions; instead, it prioritises minimising the loss based on the average output, typically computed at the last timestep. Acknowledging the reliability of TET in achieving AOI, we regularise SNN training through TET in our study.

4 Method

In this section, we present the details of the two-fold approach that helps SNN to achieve AOI. Firstly, we introduce the Spatial-Temporal Factor (STF) to help better understand how SNN utilise information over different timesteps during inference. Secondly, we propose the Spatial-Temporal Regulariser (STR) to encourage SNN to prioritise the present timestep rather than relying solely on the next timestep to achieve minimal loss during training.

4.1 Spatial-Temporal Factor

To gain a deeper understanding of the forward propagation process of SNN, we decompose the vector $\boldsymbol{V}^{l}(t)$ into two orthogonal components. This decomposition facilitates a detailed analysis of the individual contributions made by these components, unveiling the underlying mechanisms involved in information processing within the network. By dissecting the vector $\boldsymbol{V}^{l}(t)$ in this manner, we can explore how each component influences the dynamics and transformations of information in the SNN. Mathematically, the decomposition is formulated as:

\boldsymbol{V}^{l}(t)=V_{thr}^{l}\boldsymbol{\theta}^{l}(t)+\tau\boldsymbol{% \Delta}^{l}(t),

(6)

where the clipped value of $\boldsymbol{V}^{l}(t)$ is ignored as it does not contribute to either the current or the next timestep.

Building on this decomposition, we introduce the Spatial-Temporal Factor (STF), symbolised as $\xi^{l}(t)$ , to assess the interplay of $V_{thr}^{l}\boldsymbol{V}^{l}(t)$ and $\tau\boldsymbol{\Delta}^{l}(t)$ on the network. Specifically, the STF is defined as the ratio between the L2 norm of these two elements, offering an approximate estimation of their impacts on the current or future states of the network. This is encapsulated in the following expression:

\xi^{l}(t)=\tilde{\alpha}\frac{\lVert\boldsymbol{\theta}^{l}(t)\rVert_{2}}{% \lVert\boldsymbol{\Delta}^{l}(t)\rVert_{2}},

(7)

where $\lVert\cdot\rVert_{2}$ is the L2 norm, i.e., $\lVert\boldsymbol{x}\rVert_{2}=\sqrt{\sum_{i}x_{i}^{2}}$ . We define $\tilde{\alpha}$ as a consolidation of the constants $V_{thr}^{l}$ and $\tau$ , which streamlines the equation and focuses attention on the relationship between $\lVert\boldsymbol{\theta}^{l}(t)\rVert_{2}$ and $\lVert\boldsymbol{\Delta}^{l}(t)\rVert_{2}$ . Given that $\tilde{\alpha}$ is a constant, it can be seamlessly integrated into the hyper-parameter $\alpha$ in Equation 11. This formulation allows us to focus on the optimisation objective that encompasses both spatial and temporal information within a single term. The influence of regularisation on Equation 7 is visually demonstrated in Figure 3, with a detailed exploration of the regularisation technique presented in Section 4.2.

4.2 Spatial-Temporal Regularisation

This section delves into the design of a regulariser based on the STF. Our prior analysis explained that TET prompts the network to minimise average loss without considering the sequential input order, leading to uncertain predictions at each timestep. To address this, we propose increasing the STF to decouple the influence of temporal information from network training. Nonetheless, this endeavor poses challenges, particularly in determining appropriate STF values for each layer without compromising accuracy. The right side of Figure 4 shows the STF distribution across different layers for both correct and wrong predictions. On the left side, we provide additional insights based on average values of STF with respect to the layer index. One noticeable observation is the progressive increase in STF as the layers grow deeper. Additionally, our experimental findings uncover a noteworthy discrepancy in STF values between correct and wrong predictions, particularly in the deeper layers. This observation indicates that wrong predictions tend to demonstrate lower STF values within these deeper layers. This intriguing phenomenon suggests that the SNN is actively involved in dampening spike occurrences when faced with difficult inputs.

To increase STF without sacrificing the accuracy, our goal is to eliminate the worst case – the STF is relatively small – during training. Thus, our regularisation only considers correct predictions during training, achieved by masking $\xi^{l}(t)$ as follows:

\tilde{\xi}^{l}(t)=\left\{\begin{array}[]{ll}\xi^{l}(t)&\textit{if correct % prediction}\\ 0&\textit{else}.\end{array}\right.

(8)

Next, we assume that in an AOI-SNN, each timestep should contribute equally. To formalise this concept, we define $\boldsymbol{\Xi}^{l}$ as a set of $\tilde{\xi}^{l}(t)$ with all possible timesteps, expressed as:

\boldsymbol{\Xi}^{l}=[\tilde{\xi}^{l}(1),\tilde{\xi}^{l}(2),...,\tilde{\xi}^{l% }(T)]^{\top}.

(9)

Then, STR is formulated as:

R(\boldsymbol{\Xi}^{l})=(\tilde{\xi}^{l}_{min}-\tilde{\xi}^{l}_{max})^{2},

(10)

where $\tilde{\xi}_{min}^{l}$ and $\tilde{\xi}_{max}^{l}$ are the minimal and maximum values, respectively, in a mini-batch of $\boldsymbol{\Xi}^{l}$ . Note that both values are non-zero so that the incorrect samples can be excluded during regularisation. We set $\tilde{\xi}_{max}$ to a relatively optimal value, because it is generally large while still ensuring correct predictions. To consider the STR across the total $L$ layers and adjust the loss function with the hyper-parameter $\alpha$ , the final objective function becomes as:

L_{TET}+\alpha\sum_{l}^{L}R(\boldsymbol{\Xi}^{l}),

(11)

5 Experiment

In this section, we evaluate the effectiveness of our approach using uncertainty and synaptic operations as additional metrics alongside accuracy. Extensive experiments are conducted on frame-based and event-based datasets.

5.1 Evaluation Metrics

Uncertainty Estimation

It is desired that the resulting SNN are always certain about its predictions at any timestep. Thus, we use the variance of predictions as a metric to evaluate the level of uncertainty. To quantify the uncertainty in prediction, we utilise the widely used ensemble method Lakshminarayanan et al. (2017); Fort et al. (2019) as it offers a straightforward approach for quantifying prediction uncertainty. Specifically, we build an ensemble of SNN models, each trained from different weight initialisations. Such randomness in weight initialisations will lead the model to various solutions in the loss landscape, and therefore, the variance of predictions from the ensemble members reflect the uncertainty in predictions. Ensemble members are trained in parallel as they do not interact with each other. And during inference, the final prediction $\boldsymbol{\mu}(t)$ is the mean of predictions of all ensemble members:

\boldsymbol{\mu}(t)=\frac{1}{M}\sum^{M}_{i}f_{i}(\boldsymbol{X}(t)),

(12)

where $M$ is the number of members in the ensemble, $t$ is the timestep. Then, the uncertainty or variance of predictions at each timestep is calculated as:

\sigma^{2}(t)=\frac{1}{M}\sum^{M}_{i}\lVert f_{i}(\boldsymbol{X}(t))-% \boldsymbol{\mu}(t)\rVert_{2},

(13)

where larger $\sigma^{2}(t)$ implies higher uncertainty.

Synaptic Operations

The energy efficiency of neuromorphic hardware can be characterised by the energy consumption of single synaptic operation Merolla et al. (2014). Thus, we follow Rueckauer et al. (2017); Wu et al. (2022) to measure the synaptic operations from simulation for energy consumption estimation, described as:

\displaystyle\textit{Synaptic Operations}:\sum_{t}^{T}\sum_{l}^{L}f^{l}_{out}s% ^{l},

(14)

where $f^{l}_{out}$ is the number of output connections and $s^{l}$ is the average number of spikes per neuron of the $l$ -th layer.

5.2 Experiment Setup

We evaluate SNN on ResNet-19 Fang et al. (2021a); Deng et al. (2022) for Cifar10/100 Krizhevsky et al. (2009), Sew-ResNet-34 Fang et al. (2021a) for ImageNet Russakovsky et al. (2015), VGGSNN Deng et al. (2022) for Cifar10-DVS Li et al. (2017) and N-Caltech101 Orchard et al. (2015), 5-layer convoluational network Fang et al. (2021b) for DVS128 Gesture Amir et al. (2017). Instead of rescaling the input, we adopt the approach from Wu et al. (2023), in which we incorporate a downscaling layer prior to the network. Specifically, for Cifar10-DVS and N-Caltech101, we add a convolutional layer comprising 64 filters, a kernel size of 8, and strides of 4. The number of filters is increased to 128 for DVS128 Gesture. This modification allows the events to be directly fed into the SNN while preserving the event-driven features.

We employ Stochastic Gradient Descent (SGD) with an initial learning rate of 0.1 and weight decay of 5e-4 for all datasets. The training epochs are set to 300, 120, and 100 for Cifar10/100, ImageNet, and event-based datasets, respectively. The learning rate is decayed to zero at the end of training using the cosine decay schedule Loshchilov and Hutter (2016). For data augmentation, we use autoaugmentation Cubuk et al. (2019) and cutout DeVries and Taylor (2017) for Cifar10/100 and pixel shifting for event-based inputs, i.e., both width and height are randomly shifted by the range [-20%,20%]. Dropout is applied after fully-connected layer for DVS128 Gesture to improve the training and the dropout rate is 0.2. In the results, we use ‘TET’ to present the baseline Deng et al. (2022) and ‘STR( $\cdot$ )’ to denote our method with the setting of $\alpha$ in the bracket. We follow ’TET’ to adopt the surrogate method described in Esser et al. (2015). Since the direct training is extraordinary expense for the training time, we train 3 models for ImageNet and 5 models for the other datasets with different seeds so that $M$ is 3 or 5 for Equation 12 and 13.

5.3 Uncertainty Results

Figure 5 depicts the estimated uncertainty of predictions on different datasets, which reveals interesting patterns of uncertainty trends as time evolve. Specifically, it can be observed that the predictions at initial timestep tend to exhibit large uncertainty on most datasets, and then gradually reduces. This shows predictions becomes more reliable as the timestep increases. But we also notice that such trend is not obvious on the N-Caltech101 dataset, as shown in Figure 5(e). This is because the N-Caltech101 dataset has more events, providing more useful information for classification. For example, N-Caltech101 has an average of 5230 spikes per second for the input, while Cifar10-DVS only has 85.38 spikes per second. By incorporating STR, we observe a significant decrease in uncertainty in predictions, which implies improved stability and reliability in the predictions throughout the temporal sequence. Table 1 summarises the average uncertainty and the accuracy achieved at the last timestep over all datasets. The results indicate that training with STR consistently decreases uncertainty while maintaining accuracy or even achieving higher accuracy for event-based datasets.

Table 1: Comparison of average uncertainty and accuracy at the last timestep on both frame-based and event-based datasets. The optimal

\alpha

for each model is selected based on Figure 5, prioritising relatively small variance while preserving accuracy.

Dataset	Method	T	Avg. $\sigma^{2}$	Avg. Acc. (%)
Cifar10	TET	4	0.0222	95.40 $\pm$ 0.05
Cifar10	STR(0.1)	4	0.0217	95.42 $\pm$ 0.04
Cifar100	TET	4	0.0507	78.31 $\pm$ 0.14
Cifar100	STR(0.05)	4	0.0506	78.37 $\pm$ 0.22
ImageNet	TET	4	0.0373	67.46 $\pm$ 0.02
ImageNet	STR(0.05)	4	0.0370	67.54 $\pm$ 0.03
Cifar10-DVS	TET	10	0.1061	82.38 $\pm$ 0.59
Cifar10-DVS	STR(0.5)	10	0.0941	82.64 $\pm$ 0.44
N-Caltech101	TET	10	0.0456	84.84 $\pm$ 0.41
N-Caltech101	STR(0.5)	10	0.0435	85.91 $\pm$ 0.54
DVS128 Gesture	TET	16	0.0511	97.80 $\pm$ 0.37
DVS128 Gesture	STR(0.5)	16	0.0453	98.26 $\pm$ 0.30

5.4 Cutoff Results

As previously highlighted, using STR can be effective in reducing the variance across the timesteps, especially on event-based dataset in which the maximum timestep is relatively large for training. Further insights into the comparison between TET and STR are illustrated in Figure 6, considering two different inference types – one with fixed timestep and the other one with cutoff mechanism. The label ‘w/ cutoff’ signifies results with cutoff. While the curve without cutoff has often been utilised to find a balance between timesteps and accuracy Han et al. (2020); Wu et al. (2022), it has a drawback of fixing the timestep during inference, leading to a notable decline in accuracy when the timestep is small. As TET trains SNN to predict at each timestep, we directly apply softmax-based cutoff on the resulted SNN models for anytime inference. Precisely, the SNN is cutoff when the maximum softmax score at the output surpasses the predetermined threshold.

Figure 6 presents accuracy with respect to a range of cutoff threshold varies from [0.99 to 1.0] for the DVS128 Gesture, and [0.8 to 1.0] for the other datasets. In both instances, the threshold range is divided into 20 discrete values, each having an equal interval between them. It shows that with cutoff all models have significant decrease on the latency while maintaining the accuracy. Compared to frame-based input (e.g., Figure 6(a) to 6(c)), the enhancement from STR in event-based input (e.g., Figure 6(d) to 6(f)) is more substantial. This is attributed to the sparser nature and greater uncertainty in predictions associated with event input. In contrast, frame-based input data furnishes more information at each timestep, aiding in the prediction.

Table 2: Comparison with the exiting works on frame-based datasets in regard to both accuracy and latency.

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Li et al. (2021a)	ResNet-18	93.13 $\pm$ 0.07	2
Zheng et al. (2021)	ResNet-19	92.92	4
Yao et al. (2022)	ResNet-19	94.44 $\pm$ 0.10	2
Duan et al. (2022)	ResNet-19	95.45	2
[email protected]	ResNet-19	95.23 $\pm$ 0.05	1.26
[email protected]	ResNet-19	95.32 $\pm$ 0.07	1.26
[email protected]	ResNet-19	95.40 $\pm$ 0.05	1.70
[email protected]	ResNet-19	95.42 $\pm$ 0.05	1.67

(a) Cifar10

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Li et al. (2021a)	ResNet-18	71.68 $\pm$ 0.12	2
Deng et al. (2022)	ResNet-19	72.87 $\pm$ 0.10	2
Yao et al. (2022)	ResNet-19	75.48 $\pm$ 0.08	2
Duan et al. (2022)	ResNet-19	78.07	2
[email protected]	ResNet-19	78.25 $\pm$ 0.17	2.42
[email protected]	ResNet-19	78.34 $\pm$ 0.21	2.40
[email protected]	ResNet-19	78.31 $\pm$ 0.14	3.63
[email protected]	ResNet-19	78.37 $\pm$ 0.22	3.51

(b) Cifar100

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Bu et al. (2022)	ResNet-34	59.35	16
Meng et al. (2022)	ResNet-34	67.05	6
Fang et al. (2021a)	SewResNet-34	67.04	4
Duan et al. (2022)	SewResNet-34	68.28	4
[email protected]	SewResNet-34	67.40 $\pm$ 0.02	2.70
[email protected]	SewResNet-34	67.46 $\pm$ 0.04	2.70
[email protected]	SewResNet-34	67.46 $\pm$ 0.02	3.60
[email protected]	SewResNet-34	67.54 $\pm$ 0.03	3.60

Table 3: Comparison with the exiting works on event-based datasets in regard to both accuracy and latency. Note: * to indicate the network with downscaling layer.

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Fang et al. (2021a)	Wide-7B-Net	74.40	16
Meng et al. (2022)	ResNet-19	67.80	10
Li et al. (2021a)	ResNet-18	75.4 $\pm$ 0.05	10
Li et al. (2022b)	VGG-11	81.70	10
[email protected]	VGGSNN*	78.24 $\pm$ 0.99	1.71
[email protected]	VGGSNN*	79.70 $\pm$ 0.20	1.88
[email protected]	VGGSNN*	80.96 $\pm$ 0.50	3.01
[email protected]	VGGSNN*	82.08 $\pm$ 0.23	3.46

(a) Cifar10-DVS

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Li et al. (2021b)	Graph	76.10	-
Messikommer et al. (2020)	VGG-13	74.50	-
Li et al. (2022b)	VGG-11	83.70	10
[email protected]	VGGSNN*	84.07 $\pm$ 0.16	2.98
[email protected]	VGGSNN*	85.27 $\pm$ 0.30	2.97
[email protected]	VGGSNN*	84.77 $\pm$ 0.17	4.41
[email protected]	VGGSNN*	85.80 $\pm$ 0.26	4.37

(b) N-Caltech101

Methods	Architecture	Avg. Acc. (%)	Avg. $T$
Yao et al. (2021)	3-layer	98.61	60
Zheng et al. (2021)	ResNet-17	96.87	40
Fang et al. (2021b)	5-layer	97.46	20
Fang et al. (2021a)	7B-Net	97.92	16
[email protected]	5-layer*	94.16 $\pm$ 0.92	4.18
[email protected]	5-layer*	96.29 $\pm$ 0.73	4.18
[email protected]	5-layer*	96.74 $\pm$ 0.37	7.85
[email protected]	5-layer*	97.73 $\pm$ 0.54	7.46

For a comprehensive comparison with state-of-the-art (SOTA) methods, we summarise accuracy data based on different cutoff thresholds using the format method@cutoff threshold in Table 2 and 3. Note that our objective is not to achieve higher accuracy than SOTA models, but rather to achieve comparable accuracy while reducing the average timesteps required for inference. By examining the results in Table 2 and 3, it is evident that our STR approach consistently achieves competitive latency with the SOTA and baseline. Notably, when employing these techniques, SNN showcases a remarkable acceleration in inference times. With STR and cutoff, SNN achieves $2.14$ to $2.89$ times faster in inference compared to SNN presented in Table 1, which uses a fixed timestep. This enhanced efficiency is achieved with a near-zero accuracy drop of $0.50\%$ to $0.64\%$ over the event-based datasets.

Figure 7, which illustrates accuracy relative to synaptic operations, providing clear evidence that cutoff significantly diminishes the number of synaptic operations required. It should be noted that the count of synaptic operations for Cifar10/100 excludes the first layers, which is a traditional ANN layer and function as a spike encoder. STR may introduce additional spikes to enhance accuracy. For instance, in the case of DVS128 Gesture, the number of synaptic operations is higher for STR compared to TET. However, when comparing reduced synaptic operations from cutoff, the increment is marginal.

Following the presentation of our experimental results, it is pertinent to contextualise these findings within the broader scope of related work, specifically in comparison to SEENN Li et al. (2023b). Our baseline ‘TET’, akin to SEENN-I, is a direct implementation of the methods outlined in Deng et al. (2022) and with utilising the confidence score for the cutoff. Thus, this prompts us to focus the comparison over our baseline, which shares identical training settings. Comparing with SEENN-II, which requires an additional network to trigger the cutoff, our STR focus on optimising the SNN itself, maintaining its original structure and enhancing performance in cutoff. This approach provides an alternative path in SNN advancements, distinct from and not in conflict with external strategy like SEENN-II.

6 Conclusion

In light of the approach presented in our work, we have demonstrated the effectiveness of STR in enhancing the reliability of SNN for anytime inference scenarios. Combining with the cutoff mechanism, our approach further enhances the performance metrics like accuracy and latency, highlighting the comprehensive improvement potential of our novel STR technique in the SNN landscape.

References

Akopyan et al. (2015) Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems, 34(10):1537–1557, 2015.
Amir et al. (2017) Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017.
Angelopoulos and Bates (2021) Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
Bu et al. (2022) Tong Bu, Wei Fang, Jianhao Ding, PENGLIN DAI, Zhaofei Yu, and Tiejun Huang. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. In International Conference on Learning Representations, 2022.
Chen et al. (2023) Jiechen Chen, Sangwoo Park, and Osvaldo Simeone. Spikecp: Delay-adaptive reliable spiking neural networks via conformal prediction. arXiv preprint arXiv:2305.11322, 2023.
Cubuk et al. (2019) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
Datta et al. (2022) Gourav Datta, Souvik Kundu, Akhilesh R Jaiswal, and Peter A Beerel. Ace-snn: Algorithm-hardware co-design of energy-efficient & low-latency deep spiking neural networks for 3d image recognition. Frontiers in neuroscience, 16:815258, 2022.
Davies et al. (2018) Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro, 38(1):82–99, 2018.
Deng and Gu (2021) Shikuang Deng and Shi Gu. Optimal conversion of conventional artificial neural networks to spiking neural networks. In International Conference on Learning Representations, 2021.
Deng et al. (2022) Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal efficient training of spiking neural network via gradient re-weighting. In International Conference on Learning Representations, 2022.
DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Duan et al. (2022) Chaoteng Duan, Jianhao Ding, Shiyan Chen, Zhaofei Yu, and Tiejun Huang. Temporal effective batch normalization in spiking neural networks. In Advances in Neural Information Processing Systems, 2022.
Esser et al. (2015) Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dharmendra S Modha. Backpropagation for energy-efficient neuromorphic computing. Advances in neural information processing systems, 28, 2015.
Fang et al. (2021a) Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056–21069, 2021a.
Fang et al. (2021b) Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2661–2671, 2021b.
Fort et al. (2019) Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
Han et al. (2020) B. Han, G. Srinivasan, and K. Roy. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13558–13567, 2020.
Kim et al. (2022a) Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, and Priyadarshini Panda. Neural architecture search for spiking neural networks. In European Conference on Computer Vision, pages 36–56. Springer, 2022a.
Kim et al. (2022b) Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, and Priyadarshini Panda. Exploring lottery ticket hypothesis in spiking neural networks. In European Conference on Computer Vision, pages 102–120. Springer, 2022b.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
Kulkarni and Rajendran (2018) Shruti R. Kulkarni and Bipin Rajendran. Spiking neural networks for handwritten digit recognition—supervised learning and network optimization. Neural Networks, 103:118–127, 2018.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
Li et al. (2022a) Chen Li, Lei Ma, and Steve Furber. Quantization framework for fast spiking neural networks. Frontiers in Neuroscience, 16:918793, 2022a.
Li et al. (2023a) Chen Li, Edward Jones, and Steve Furber. Unleashing the potential of spiking neural networks by dynamic confidence. arXiv preprint arXiv:2303.10276, 2023a.
Li et al. (2017) Hongmin Li, Hanchao Liu, ** Shi. Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience, 11:309, 2017.
Li et al. (2021a) Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable spike: Rethinking gradient-descent for training spiking neural networks. Advances in Neural Information Processing Systems, 34:23426–23439, 2021a.
Li et al. (2021b) Yi** Li, Han Zhou, Bangbang Yang, Ye Zhang, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. Graph-based asynchronous event processing for rapid object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 934–943, 2021b.
Li et al. (2022b) Yuhang Li, Youngeun Kim, Hyoungseob Park, Tamar Geller, and Priyadarshini Panda. Neuromorphic data augmentation for training spiking neural networks. In European Conference on Computer Vision, pages 631–649. Springer, 2022b.
Li et al. (2023b) Yuhang Li, Tamar Geller, Youngeun Kim, and Priyadarshini Panda. Seenn: Towards temporal spiking early-exit neural networks. arXiv preprint arXiv:2304.01230, 2023b.
Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
Lu and Sengupta (2020) Sen Lu and Abhronil Sengupta. Exploring the connection between binary and spiking neural networks. Frontiers in Neuroscience, 14:535, 2020.
Maass (1997) Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997.
Meng et al. (2022) Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo. Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12444–12453, 2022.
Merolla et al. (2014) P.A. Merolla, J.V. Arthur, R. Alvarez-Icaza, A.S. Cassidy, J. Sawada, F. Akopyan, B.L. Jackson, N. Imam, C. Guo, Y. Nakamura, and B. Brezzo. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
Messikommer et al. (2020) Nico Messikommer, Daniel Gehrig, Antonio Loquercio, and Davide Scaramuzza. Event-based asynchronous sparse convolutional networks. In European Conference on Computer Vision, pages 415–431. Springer, 2020.
Neftci et al. (2019) Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
Orchard et al. (2015) Garrick Orchard, A**kya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9:437, 2015.
Pei et al. (2019) **g Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019.
Pfeiffer and Pfeil (2018) Michael Pfeiffer and Thomas Pfeil. Deep learning with spiking neurons: Opportunities and challenges. Frontiers in neuroscience, 12:774, 2018.
Putra and Shafique (2021) Rachmad Vidya Wicaksana Putra and Muhammad Shafique. Q-spinn: A framework for quantizing spiking neural networks. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
Rueckauer et al. (2017) Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience, 11:682, 2017.
Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
Schaefer and Joshi (2020) Clemens JS Schaefer and Siddharth Joshi. Quantizing spiking neural networks with integers. In International Conference on Neuromorphic Systems 2020, pages 1–8, 2020.
Shrestha and Orchard (2018) Sumit B Shrestha and Garrick Orchard. Slayer: Spike layer error reassignment in time. Advances in neural information processing systems, 31, 2018.
Taherkhani et al. (2020) Aboozar Taherkhani, Georgina Cosma, and Thomas Martin McGinnity. Optimization of output spike train encoding for a spiking neuron based on its spatio–temporal input pattern. IEEE Transactions on Cognitive and Developmental Systems, 12(3):427–438, 2020.
Tang et al. (2021) Guangzhi Tang, Neelesh Kumar, Raymond Yoo, and Konstantinos Michmizos. Deep reinforcement learning with population-coded spiking neural network for continuous control. In Conference on Robot Learning, pages 2016–2029. PMLR, 2021.
Timcheck et al. (2023) Jonathan Timcheck, Sumit Bam Shrestha, Daniel Ben Dayan Rubin, Adam Kupryjanow, Garrick Orchard, Lukasz Pindor, Timothy Shea, and Mike Davies. The intel neuromorphic dns challenge. Neuromorphic Computing and Engineering, 3(3):034005, 2023.
Wu et al. (2022) Dengyu Wu, ** Yi, and Xiaowei Huang. A little energy goes a long way: Build an energy-efficient, accurate spiking neural network from convolutional neural network. Frontiers in neuroscience, 16, 2022.
Wu et al. (2023) Dengyu Wu, Gaojie **, Han Yu, ** Yi, and Xiaowei Huang. Optimising event-driven spiking neural network with regularisation and cutoff. arXiv preprint arXiv:2301.09522, 2023.
Wu et al. (2018) Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Lu** Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
Wu et al. (2019) Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan ** Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence, pages 1311–1318, 2019.
Yao et al. (2021) Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10221–10230, 2021.
Yao et al. (2022) Xingting Yao, Fanrong Li, Zitao Mo, and Jian Cheng. Glif: A unified gated leaky integrate-and-fire neuron for spiking neural networks. Advances in Neural Information Processing Systems, 35:32160–32171, 2022.
Yin et al. (2021) Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905–913, 2021.
Zenke and Ganguli (2018) Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.
Zheng et al. (2021) Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI conference on artificial intelligence, pages 11062–11070, 2021.
Zhou et al. (2023) Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations, 2023.

Direct Training Needs Regularisation: Anytime Optimal Inference Spiking Neural Network