TRIP: Trainable Region-of-Interest Prediction for Hardware-Efficient Neuromorphic Processing on Event-based Vision thanks: This work was partially funded by research and innovation projects REBECCA (KDT JU under grant agreement No. 101097224), NeuroKIT2E (KDT JU under grant agreement No. 101112268), and NimbleAI (Horizon EU under grant agreement 101070679).
Corresponding author. Email: [email protected]
Code: https://github.com/ERNIS-LAB/TRIP

Cina Arjmand1, Yingfu Xu1, Kevin Shidqi1, Alexandra F. Dobrita1, Kanishkan Vadivel1,
Paul Detterer1, Manolis Sifalakis1, Amirreza Yousefzadeh2 and Guangzhi Tang3,† 1 imec, Eindhoven, The Netherlands 2 EEMCS, University of Twente, Enschede, The Netherlands
3 DACS, Maastricht University, Maastricht, The Netherlands
Abstract

Neuromorphic processors are well-suited for efficiently handling sparse events from event-based cameras. However, they face significant challenges in the growth of computing demand and hardware costs as the input resolution increases. This paper proposes the Trainable Region-of-Interest Prediction (TRIP), the first hardware-efficient hard attention framework for event-based vision processing on a neuromorphic processor. Our TRIP framework actively produces low-resolution Region-of-Interest (ROIs) for efficient and accurate classification. The framework exploits sparse events’ inherent low information density to reduce the overhead of ROI prediction. We introduced extensive hardware-aware optimizations for TRIP and implemented the hardware-optimized algorithm on the SENECA neuromorphic processor. We utilized multiple event-based classification datasets for evaluation. Our approach achieves state-of-the-art accuracies in all datasets and produces reasonable ROIs with varying locations and sizes. On the DvsGesture dataset, our solution requires 46×46\times46 × less computation than the state-of-the-art while achieving higher accuracy. Furthermore, TRIP enables more than 2×2\times2 × latency and energy improvements on the SENECA neuromorphic processor compared to the conventional solution.

I Introduction

Low-power and low-latency event-based vision is uniquely suited for edge applications. Given the efficiency of sensing, develo** equally efficient processing becomes crucial for optimizing the performance of edge solutions. Since the event-based camera inherently generates sparse data, exploiting this sparsity is essential for enhancing the processing efficiency. Neuromorphic computing offers event-driven solutions to process sparse data streams efficiently, making it a natural fit for event-based vision [1, 2]. However, with the growing resolution of event-based cameras, neuromorphic computing faces computing and hardware cost challenges [1]. These challenges are further amplified when employing Convolutional Neural Networks (CNNs), as the computational expenses and on-chip memory demands for processing CNNs on neuromorphic processors increase with input resolution [3].

Refer to caption
Figure 1: Overview of TRIP performing event-based vision classification on the SENECA neuromorphic processor.

To address the challenges of high-resolution visual processing, one approach is the hard attention algorithm, which selectively focuses on regions of an input image for processing [4, 5]. Compared to uniformly downsampling the entire input, the hard attention mechanism actively chooses regions of interest (ROI) with more critical information, improving accuracy while limiting the computing and memory costs of network processing. However, hard attention algorithms require an additional neural network to predict the ROI accurately. This demands sophisticated training methods and introduces additional overheads for visual processing. Therefore, the benefits gained from processing a reduced-dimension ROI can be offset by the high costs of ROI prediction [6]. The trade-off becomes particularly pronounced as the complexity of the scene increases, potentially negating the efficiency gains in ROI processing [7].

Interestingly, the inherent sparsity of event-based vision reduces the information density of scenes [8], which can potentially mitigate the hard attention overhead on ROI prediction. This characteristic opens up opportunities for efficient event-based vision processing, especially when hard attention is integrated with neuromorphic processors. By reducing input dimensionality, the hard attention algorithm can significantly reduce the computational and memory demands of CNNs on neuromorphic processors. Moreover, the event-driven processing further diminishes the latency and energy overhead associated with hard attention when utilizing CNNs with sparse activation [9]. This synergy opens prospects for tailoring hard attention algorithms on the neuromorphic processor.

In this paper, we propose the Trainable Region-of-Interest Prediction (TRIP) framework for hardware-efficient event-based vision processing on the neuromorphic processor. Our TRIP framework performs efficient ROI prediction with low-resolution event streams and supports end-to-end training by employing differentiable truncated Gaussian kernels (tGK) for ROI generation. We introduced hardware-aware optimizations for TRIP to improve the algorithm’s hardware efficiency without sacrificing accuracy. We implemented the hardware-optimized TRIP algorithm on the SENECA neuromorphic processor [10] and evaluated our method on event-based classification datasets [11, 12]. Our method achieves state-of-the-art accuracies while reducing the computation cost by 46×\times× compared to the state-of-the-art efficient algorithm [13]. Compared to neuromorphic solutions on Intel’s Loihi and IBM’s TrueNorth neuromorphic processors [3, 11, 14], our TRIP-based solution significantly reduces the area and energy consumption while having higher accuracy.

II Related Works

II-A Hard Attention Visual Processing

Hard attention strategies for restricting computations by directing image processing towards relevant regions of input space have long been explored in computer vision. Early models analyze low-level image features to predict regions of high saliency based on variations in pixel intensity [15]. Later works increasingly emphasized the task of salient region prediction as an action selection policy [5], iteratively improving predictions over time. Reinforcement learning (RL) algorithms have been adopted in hard attention to learn the optimal policy for placing a sensor with limited bandwidth on a given input region [7, 6, 16]. While RL-based hard attention is effective, the training complexity poses major challenges for hardware-aware optimization.

Deep Recurrent Attention Writer (DRAW) uses recurrent units within a variational autoencoder to iteratively predict salient regions of input images [17]. Importantly, DRAW uses a differentiable mechanism for generating an ROI with Gaussian kernels, enabling end-to-end backpropagation training without using RL. Neuromorphic DRAW applied the differentiable crop** of DRAW to event-based classification tasks to improve accuracy by filtering out irrelevant events [18]. Our TRIP framework leverages DRAW’s Gaussian kernels to facilitate differentiable hard attention while introducing hardware-efficient algorithm designs for event-based vision processing on neuromorphic processors.

II-B Event-based CNN and SENECA Neuromorphic Processor

Event-based CNN, trained by specialized activation regularization methods, has high activation sparsity within each network layer [19]. SENECA is a multi-core embedded digital neuromorphic processor specialized in processing event-based CNNs [10]. It performs event-driven computation that exploits the sparsity in sensory inputs and network activations. Additionally, it executes data-flow processing across cores, increasing the parallelism of network processing and diminishing the memory cost for neural activations. Event-driven depth-first convolution is a unique scheduling method SENECA supports for event-based CNNs [2]. The method prioritizes the network’s layer dimension by consuming neural activation events right after their generation. Therefore, it maximizes the neuromorphic processor’s benefits on parallelism and latency. Our TRIP framework with event-based CNN maximizes the hardware efficiency of hard attention on SENECA by exploiting the hardware advantages.

III Method

III-A TRIP: Trainable Region-of-Interest Prediction

Refer to caption
Figure 2: Processing pipeline of TRIP for the event-based gesture recognition task. The downsampled events are fed into the ROI prediction event-based CNN to predict the ROI parameters. The ROI generation module uses the parameters to create the ROI fed into the classification event-based CNN. Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the output of the ReLU recurrent unit, and Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the output for processing events in timebin t𝑡titalic_t.

We propose the Trainable Region-of-Interest Prediction (TRIP) framework for efficient event-based classification. It uses hard attention within an event-driven neuromorphic processing pipeline. The framework efficiently classifies event streams using an actively generated ROI that is predicted from the input events. An ROI’s receptive field covers a small region of the event-based camera’s field of view. As shown in Figure 2, our TRIP framework consists of three subsequent components: ROI prediction, ROI generation, and classification. The ROI prediction component consists of an event-based CNN that determines the location and receptive field of the ROI. It predicts the ROI parameters using a downsampled low-resolution input, reducing the processing overhead of ROI prediction. The ROI generation component generates the cropped ROI using the predicted parameters. It uses an N×N𝑁𝑁N\times Nitalic_N × italic_N grid of differentiable truncated Gaussian kernels (tGK) to produce a fixed N×N𝑁𝑁N\times Nitalic_N × italic_N output from a varying-size receptive field. This ensures consistently low processing cost of classification. Moreover, we introduce dynamic average pooling (DAP) to replace tGK for efficient inference on the embedded neuromorphic processor. The classification component consists of an event-based CNN that performs classification on the ROI. The entire framework is differentiable, allowing it to be trained end-to-end. For efficient computing on SENECA, we increase the activation sparsity of the event-based networks during training.

III-B ROI Prediction

The ROI prediction component produces the ROI parameters based on the downsampled input events from max-pooling. The ROI prediction network outputs three scalar values. These values are decoded to determine the ROI location and receptive field as follows,

gx=A2(tanh(gx^)+1)subscript𝑔𝑥𝐴2𝑡𝑎𝑛^subscript𝑔𝑥1g_{x}=\frac{A}{2}\cdot(tanh(\hat{g_{x}})+1)italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG italic_A end_ARG start_ARG 2 end_ARG ⋅ ( italic_t italic_a italic_n italic_h ( over^ start_ARG italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) + 1 ) (1)
gy=B2(tanh(gy^)+1)subscript𝑔𝑦𝐵2𝑡𝑎𝑛^subscript𝑔𝑦1g_{y}=\frac{B}{2}\cdot(tanh(\hat{g_{y}})+1)italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG italic_B end_ARG start_ARG 2 end_ARG ⋅ ( italic_t italic_a italic_n italic_h ( over^ start_ARG italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ) + 1 ) (2)
δ=S(sigmoid(δ^)+1)𝛿𝑆𝑠𝑖𝑔𝑚𝑜𝑖𝑑^𝛿1\delta=S\cdot(sigmoid(\hat{\delta})+1)italic_δ = italic_S ⋅ ( italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( over^ start_ARG italic_δ end_ARG ) + 1 ) (3)

where gx^,gy^,δ^^subscript𝑔𝑥^subscript𝑔𝑦^𝛿\hat{g_{x}},\hat{g_{y}},\hat{\delta}over^ start_ARG italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_δ end_ARG are the raw scalar outputs from the ROI prediction network, (gx,gysubscript𝑔𝑥subscript𝑔𝑦g_{x},g_{y}italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT) is the center location of the overall receptive field formed by the N×N𝑁𝑁N\times Nitalic_N × italic_N grid of tGK, δ𝛿\deltaitalic_δ is the distance between two adjacent tGK, A𝐴Aitalic_A and B𝐵Bitalic_B are the image width and height, and S𝑆Sitalic_S is a distance scaling factor. The gxsubscript𝑔𝑥g_{x}italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and gysubscript𝑔𝑦g_{y}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT parameters always initialize in the center of the image at the start of training and can move across the entire input space. The variable δ𝛿\deltaitalic_δ allows control over the size of the receptive field of the ROI.

III-C ROI Generation

We employed tGK for ROI generation, an efficient variation of the method introduced in DRAW [17]. The ROI generation component generates the N×N×2𝑁𝑁2N\times N\times 2italic_N × italic_N × 2 fixed-resolution input ROI event streams for classification, in which N𝑁Nitalic_N is the width and height of the ROI and 2222 is the polarity channel. The component uses N×N𝑁𝑁N\times Nitalic_N × italic_N differentiable tGK to compute the ROI during training. The 2D mean positions of the tGK are computed according to the predicted center location of the overall receptive field as follows,

μxi=gx+(iN20.5)δ,i[0,N1]formulae-sequencesuperscriptsubscript𝜇𝑥𝑖subscript𝑔𝑥𝑖𝑁20.5𝛿𝑖0𝑁1\mu_{x}^{i}=g_{x}+(i-\frac{N}{2}-0.5)\cdot\delta,\ i\in[0,N-1]italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ( italic_i - divide start_ARG italic_N end_ARG start_ARG 2 end_ARG - 0.5 ) ⋅ italic_δ , italic_i ∈ [ 0 , italic_N - 1 ] (4)

where μxisuperscriptsubscript𝜇𝑥𝑖\mu_{x}^{i}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the mean x-axis position of the tGK on the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column. The mean y-axis position of the tGK can be computed using the same equation with gysubscript𝑔𝑦g_{y}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Eventually, each tGK has a two-dimensional mean position (μxi,μyj)superscriptsubscript𝜇𝑥𝑖superscriptsubscript𝜇𝑦𝑗(\mu_{x}^{i},\mu_{y}^{j})( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ), in which i𝑖iitalic_i and j𝑗jitalic_j are the column and row index. Here, we assume N𝑁Nitalic_N is an even number.

By knowing the mean positions of the tGK, we compute the weight of the tGK corresponding to each pixel location of the input events as follows,

Fxi[n]={exp((nμxi)22σ)for n[μxiθ2,μxi+θ2]0otherwisesuperscriptsubscript𝐹𝑥𝑖delimited-[]𝑛casessuperscript𝑛superscriptsubscript𝜇𝑥𝑖22𝜎for 𝑛superscriptsubscript𝜇𝑥𝑖𝜃2superscriptsubscript𝜇𝑥𝑖𝜃20otherwiseF_{x}^{i}[n]=\begin{cases}\exp({\frac{(n-\mu_{x}^{i})^{2}}{2\sigma}})&\text{% for }n\in[\mu_{x}^{i}-\frac{\theta}{2},\ \mu_{x}^{i}+\frac{\theta}{2}]\\ 0&\text{otherwise}\end{cases}italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [ italic_n ] = { start_ROW start_CELL roman_exp ( divide start_ARG ( italic_n - italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ end_ARG ) end_CELL start_CELL for italic_n ∈ [ italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG , italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (5)

where Fxi[n]superscriptsubscript𝐹𝑥𝑖delimited-[]𝑛F_{x}^{i}[n]italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [ italic_n ] is the x-dimension weight component of tGK on the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column corresponding to pixel locations at column n𝑛nitalic_n, σ𝜎\sigmaitalic_σ is the variance which is a pre-defined parameter, and θ𝜃\thetaitalic_θ is the size of the tGK with non-zero weights. The y-dimension weight component is computed in a similar manner.

Each ROI input event value to the classification network is computed by the corresponding tGK as follows,

v(xi,yj)=𝐅𝐱𝐢𝐈𝐅𝐲𝐣subscript𝑣subscript𝑥𝑖subscript𝑦𝑗superscriptsubscript𝐅𝐱𝐢𝐈superscriptsubscript𝐅𝐲𝐣v_{(x_{i},y_{j})}=\mathbf{F_{x}^{i}}\cdot\mathbf{I}\cdot\mathbf{{F_{y}^{j}}}italic_v start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT ⋅ bold_I ⋅ bold_F start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_j end_POSTSUPERSCRIPT (6)

where v(xi,yj)subscript𝑣subscript𝑥𝑖subscript𝑦𝑗v_{(x_{i},y_{j})}italic_v start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT is the value of the event at location (xi,yj)subscript𝑥𝑖subscript𝑦𝑗(x_{i},y_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) of the N×N×2𝑁𝑁2N\times N\times 2italic_N × italic_N × 2 input to the classification network, 𝐅𝐱𝐢superscriptsubscript𝐅𝐱𝐢\mathbf{F_{x}^{i}}bold_F start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i end_POSTSUPERSCRIPT and 𝐅𝐲𝐣superscriptsubscript𝐅𝐲𝐣\mathbf{{F_{y}^{j}}}bold_F start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_j end_POSTSUPERSCRIPT are the weights for the corresponding tGK, and 𝐈𝐈\mathbf{I}bold_I are the binned raw input events from one polarity. The two polarity channels of the classification inputs are computed using the same equation.

Compared to Gaussian kernels introduced by DRAW, our tGK significantly reduces the computation required for ROI generation while maintaining differentiable. Specifically, our adoption of tGK reduces the computational complexity of ROI generation from O(AB)𝑂𝐴𝐵O(AB)italic_O ( italic_A italic_B ) to O(θ2)𝑂superscript𝜃2O(\theta^{2})italic_O ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by skip** the pixel locations with insignificant weights. Since θ𝜃\thetaitalic_θ is at least ten times smaller than A𝐴Aitalic_A and B𝐵Bitalic_B in practice, the tGK can be orders of magnitude more efficient than Gaussian kernels.

III-D Hardware-Efficient Dynamic Average Pooling

The tGK can be accelerated by customized application-specific integrated circuit (ASIC) designs. However, its efficiency is hard to achieve on the embedded CPU within our targeted neuromorphic processor. Though the number of computations is small, the overheads of locating the non-zero elements and performing weighted operations for tGK are substantial. Firstly, assigning each input event to multiple overlap** kernels requires a complex implementation to avoid iterating all kernels, introducing significant overhead to the instruction memory. Secondly, the distance between the kernel center and the event location must be computed for each assigned event, bringing additional overhead on computation. Hence, tGK on an embedded core with limited instruction memory and compute capability is not feasible.

To mitigate the problem, we introduce Dynamic Average Pooling (DAP) as a hardware-efficient alternative for ROI generation during inference on the embedded neuromorphic processor. The DAP replaces the Gaussian kernels with simple non-overlap** average poolings. The kernel size of the average pooling changes dynamically based on the size of the overall receptive field the ROI corresponds to. We compute the range of the overall receptive field using the ROI parameters from the ROI prediction component as follows,

xmax=gx+(N20.5)δ+θ2subscript𝑥𝑚𝑎𝑥subscript𝑔𝑥𝑁20.5𝛿𝜃2x_{max}=g_{x}+(\frac{N}{2}-0.5)\cdot\delta+\frac{\theta}{2}italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ( divide start_ARG italic_N end_ARG start_ARG 2 end_ARG - 0.5 ) ⋅ italic_δ + divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG (7)
xmin=gx(N2+0.5)δθ2subscript𝑥𝑚𝑖𝑛subscript𝑔𝑥𝑁20.5𝛿𝜃2x_{min}=g_{x}-(\frac{N}{2}+0.5)\cdot\delta-\frac{\theta}{2}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - ( divide start_ARG italic_N end_ARG start_ARG 2 end_ARG + 0.5 ) ⋅ italic_δ - divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG (8)

and the dynamic kernel size of each average pooling in the DAP is computed as follows,

kDAP=(xmaxxmin)/Nsubscript𝑘𝐷𝐴𝑃subscript𝑥𝑚𝑎𝑥subscript𝑥𝑚𝑖𝑛𝑁k_{DAP}=(x_{max}-x_{min})/Nitalic_k start_POSTSUBSCRIPT italic_D italic_A italic_P end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) / italic_N (9)

where xmaxsubscript𝑥𝑚𝑎𝑥x_{max}italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and xminsubscript𝑥𝑚𝑖𝑛x_{min}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT define the range of the receptive field on the x-axis of the raw input space. The range on the y-axis can be computed using the same equations with gysubscript𝑔𝑦g_{y}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and the receptive field is square.

The embedded implementation of DAP is simple. A closed-form equation exists to compute the sole corresponding kernel of each input event, and the ROI generation is distance invariant. However, the ROI parameters in DAP are not differentiable. Therefore, we first train the networks with tGK, and then fine-tune the classification network with a fixed ROI prediction network and DAP.

III-E Hardware-Aware Event-based CNN

The event-driven neuromorphic processor exploits activation sparsities in neural networks by only processing non-zero activations. Therefore, input to each layer for synaptic operation is supposed to be as sparse as possible. To maximize the efficiency of TRIP on the event-driven neuromorphic processor, we adopt event-based CNNs for ROI prediction and classification. Unlike regular CNN, our event-based convolutional layer for the neuromorphic processor performs BatchNorm and MaxPool before the ReLU activation, outputting sparse events straight to the subsequent layer for synaptic integration. Moreover, we used ReLU function in the vanilla RNN for sparse recurrent processing. Furthermore, we perform hardware-aware optimizations on the event-based CNNs. The optimizations comprise sparsity-aware and quantization-aware training, reducing projected computation cost and memory requirement on the hardware.

To increase the activation sparsity of event-based CNNs in TRIP, we adopt the L1𝐿1L1italic_L 1 regularization loss [19] on the activation values of the layers that have ReLU as the activation function. The loss encourages the network to reduce the activation values so as to have fewer non-zero activations and increase the sparsity. Additionally, we use quantization-aware training to reduce weight precision to 4 bits [2]. There is a shared power-of-two scaling factor s𝑠sitalic_s for all the weights of the same layer. During on-chip computation, a weight value is obtained by multiplying the saved 4-bit integer with 2ssuperscript2𝑠2^{s}2 start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. The quantized parameters reduce the on-chip memory required for network deployments and the computation cost of synaptic integration.

TABLE I: Performance comparisons on the DvsGesture dataset.
Architecture Input Resolution Param Effective MACs Accuracy [%] Accuracy [%]
(Single Timebin) (mean ±plus-or-minus\pm± std) (Maximum)
LSTM [20] 32×\times×32 7.4M 3.9M 86.8
AlexNet+LSTM[21] 128×\times×128 8.3M 601.3M 97.7
CNN+EGRU [13] 128×\times×128 4.8M 80.6M 97.3±plus-or-minus\pm± 0.4 97.8
ConvLIAF [22] 32×\times×32 0.22M 113.3M 97.6
TRIP (Ours) 16×\times×16+12×\times×12 0.46M 1.75M 97.6 ±plus-or-minus\pm± 0.5 98.6
Refer to caption
Figure 3: Visualization of ROI’s receptive fields for different gestures in the DvsGesture dataset. The receptive fields include the pixels involved in the ROI generation. They have superimposed on top of the timebinned event streams as a yellow rectangle.

IV Experiment and Results

We benchmarked the performance of TRIP on challenging event-based classification datasets and with the SENECA neuromorphic processor. Firstly, we experimented on the widely used DvsGesture dataset [11] to demonstrate the effectiveness of TRIP in terms of accuracy, model size, and algorithmic computing cost. Secondly, we used the Marshalling Signals gesture recognition dataset [12] to evaluate TRIP’s robustness towards samples at varying distances from the event-based camera. Our results demonstrate that the ROI prediction is dynamically adaptable to varying distances. Thirdly, we synthetically generated a noisy, high-resolution event-based dataset based on N-MNIST [23] with digits in varying sizes and locations. Using this dataset, we validated the effectiveness and overhead of TRIP compared to baselines with similar cost or accuracy. Finally, we implemented our hardware-optimized TRIP algorithm on the SENECA neuromorphic processor and measured the energy, latency, and effective area of the solution.

IV-A DvsGesture Dataset

Gesture recognition is an ideal task for evaluating approaches with hard attention, as a compact region of the input can provide sufficient information for classification. The DvsGesture dataset [11] enables us to compare our method with other state-of-the-art solutions on the task of gesture recognition with an event-based camera.

IV-A1 Dataset and Network Overview

The dataset is recorded using the DVS128 event-based camera with 128×128128128128\times 128128 × 128 resolution. It consists of 11 gesture classes in 1176 training sessions and 288 testing sessions. Each session includes a subject repeatedly performing the same gesture. We preprocessed each session using SpikingJelly [24] into an event sample of 32 timebins, in which the events from the same pixel location are accumulated together within each timebin. We performed data augmentation during training to randomly scale, rotate, and spatially shift training samples. We downsampled the input resolution to 16×16161616\times 1616 × 16 for ROI prediction. The ROI prediction network comprises three convolutional layers, a ReLU recurrent layer, and an output layer. We used 12×12121212\times 1212 × 12 tGKs to generate the ROI input for classification. The classification network comprises two convolutional layers, a fully-connected hidden layer, and an output layer.

IV-A2 Results

We compared accuracies, number of parameters, and effective MAC operations with other state-of-the-art methods in Table I. The effective MAC counts the averaged non-zero multiply-accumulate operations within all components of TRIP for processing a single timebin of the event stream, reflecting the computing cost on event-driven neuromorphic processors. Our TRIP framework achieves state-of-the-art accuracy while reducing the effective MAC by 46×46\times46 × compared to the lowest among the other state-of-the-art approaches. TRIP achieves tremendous computational efficiency gains through two key differentiators: firstly, by operating on considerably lower input resolutions compared to other CNN-based methods, and secondly, by utilizing less complex network architectures while processing a reduced input space with less irrelevant information.

We visualized the receptive fields used for generating ROIs for classification in Figure 3. The visualization helps for interpreting the decision process of TRIP and further explains the reason behind TRIP’s efficiency advantage. By visually inspecting the samples, we can see the ROI prediction network learns to track the gestures intelligently and focus on salient regions of the input space. For example, in the “left hand clockwise” gesture, the ROI’s receptive field tracks the arm’s movement, making the classification network easier to make a decision.

IV-B Marshalling Signals Dataset

The Marshalling Signals dataset [12] is more recent, less explored, and more difficult than DvsGesture. The dataset presents gestures at multiple distances from the event-based camera. Therefore, it allows us to further test the ROI prediction, particularly its ability to adjust to varying sizes.

IV-B1 Dataset and Network Overview

The Marshalling Signals dataset [12] is recorded using the DAVIS 346 event-based camera with 346×224346224346\times 224346 × 224 resolution. It contains 10 gesture classes in 11,040 training samples and 930 testing samples. Each sample is one gesture presented in a 960 ms timebin. Each gesture is presented in 8 evenly spaced distances from the camera ranging from 1.5m to 4.5m. We adopt the same network architectures as the DvsGesture task with a higher dimension ReLU recurrent layer in the ROI prediction network. We downsampled the input solution to 43×28432843\times 2843 × 28 for ROI prediction and used 12×12121212\times 1212 × 12 tGKs for ROI generation.

IV-B2 Results

We compared the performance of our model with the previous results in Table II. Since [12] uses regular CNN architectures, we used FLOPs as an efficiency metric, without considering the activation sparsities in our event-based CNNs. Our TRIP framework achieves better accuracy while reducing the FLOPs by 18×18\times18 × compared to EfficientNet [25]. By visualizing ROI’s receptive fields for different distances in Figure 4, we show that the ROI prediction can adjust the ROI size for classification to include only the relevant region of the input space.

Refer to caption
Figure 4: Visualization of ROI’s receptive fields for gestures performed at different distances in the Marshalling Signals dataset.
TABLE II: Performance comparisons on the Marshalling Signals dataset.
Architecture Param FLOPs Accuracy [%]
ResNet18 [12] 11.7M 1810M 74.6
EfficientNet-B1 [12] 7.794M 690M 82.6
TRIP (Ours) 4.13M 37.0M 83.6

IV-C Synthetic Dataset based on N-MNIST

To study the effects of the reduced input resolutions and the hard attention overheads in a controlled setup, we synthetically generated a dataset based on the N-MNIST dataset [23]. The generated dataset enables us to test the performance of TRIP under different input resolutions and structured event noises.

IV-C1 Dataset Generation and Network Overview

We generated the synthetic N-MNIST dataset by randomly scaled event streams of 34×34343434\times 3434 × 34 resolution N-MNIST digits on arbitrary locations of a 128×128128128128\times 128128 × 128 canvas. The scaling factor for each sample is randomly selected between 1 to 2. We add structured event noises by randomly selecting 8 other digits from the dataset, crop** a random 8×8888\times 88 × 8 subsection of each digit, and placing the subsections in random locations on the canvas. Figure 5 shows some examples of the generated samples. The synthetic dataset has the same number of samples as the original N-MNIST dataset, including 60,000 training and 10,000 testing samples.

We used the same network architectures in TRIP as the DvsGesture task but with only 2 convolutional layers for the ROI prediction network. We used 12×12121212\times 1212 × 12 tGKs for ROI generation. The baseline networks have the same number of layers as TRIP, which comprises 4 convolutional layers, a ReLU recurrent layer, a fully-connected layer, and an output layer. We tested different input resolutions for the baselines and TRIP’s ROI prediction, including 16×16161616\times 1616 × 16, 32×32323232\times 3232 × 32, and 64×64646464\times 6464 × 64. The baseline networks have varying layer dimensions based on the input resolution.

IV-C2 Results

We compared the performance of our model with the baseline models on different input resolutions in Table III. Comparing the baseline networks using one level higher input resolutions (16×1632×321616323216\times 16\rightarrow 32\times 3216 × 16 → 32 × 32 and 32×3264×643232646432\times 32\rightarrow 64\times 6432 × 32 → 64 × 64), TRIP achieves higher or similar accuracies with reductions in FLOPs. This shows the low input resolution required by TRIP to maintain high accuracy compensates for the hard attention overheads introduced by the ROI prediction and generation. Moreover, the visualization results in Figure 5 shows the ROI prediction network can handle inputs with structured noises which share similar features with the digits and hard to differentiate in low resolution.

Refer to caption
Figure 5: Example of synthetic N-MNIST samples (from left to right: digit 7, 3, and 0), showing ROI generated by network.
TABLE III: Performance comparisons on synthetic N-MNIST dataset.
Architecture Param FLOPs Accuracy [%]
(mean ±plus-or-minus\pm± std)
Baseline (16x16) 0.31M 6.0M 71.8±plus-or-minus\pm± 2.3
Baseline (32x32) 0.67M 24.4M 93.0±plus-or-minus\pm± 0.6
Baseline (64x64) 0.67M 57.4M 96.2±plus-or-minus\pm± 0.9
TRIP (16x16) 0.30M 16.0M 95.4 ±plus-or-minus\pm± 0.4
TRIP (32x32) 0.65M 28.0M 96.1 ±plus-or-minus\pm± 0.3

IV-D Neuromorphic Processor Deployment

TABLE IV: Comparison with state-of-the-art neuromorphic implementations on the DvsGesture dataset.
Single Timebin Multiple Timebins
Hardware Solutions Technology Core Area Latency Einfsubscript𝐸𝑖𝑛𝑓E_{inf}italic_E start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT Accuracy Latency Einfsubscript𝐸𝑖𝑛𝑓E_{inf}italic_E start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT Accuracy
[#] [mm2] [ms] [uJ] [%] [ms] [uJ] [%]
Loihi [26] Spiking CNN [3] Intel 14 nm >>>20 >>>8.20 11 89.6
Loihi [26] Spiking CNN [14] Intel 14 nm 59 24.19 22.0 2731 96.2
TrueNorth [27] Spiking CNN [11] Samsung 28 nm 3838 383.8 91.8 104.6 18702 94.6
SENECA [10] Event-based CNN GF FDX 22 nm 7 3.29 78.9 1069.2 97.3
SENECA [10] TRIP GF FDX 22 nm 9 4.23 2.7 35.86 91.1 25.8 430.32 98.3

To accurately assess the hardware efficiency of TRIP, we implemented our hardware-optimized TRIP algorithm on the SENECA neuromorphic processor [10]. To compare with state-of-the-art neuromorphic solutions on event-based vision, we used the DvsGesture dataset [11] to benchmark the performance of our solution in terms of accuracy, latency, energy consumption, and hardware’s effective area.

IV-D1 Hardware-optimized TRIP

The hardware-aware optimizations for TRIP include sparsity-aware training on event-based CNNs, quantization-aware training to get network parameters in low precision, and utilizing DAP for ROI generation. The optimization process comprises three steps. First, we performed sparsity-aware training on pre-trained networks in TRIP to reduce the number of activations in event-based CNNs. Second, we conducted quantization-aware training on the ROI prediction network. The incremental quantization-aware training iteratively quantizes and trains each layer with the straight-through gradient estimator. The training freezes optimally quantized layers and trains the remaining layers. Third, we substituted the truncated Gaussian kernels with DAP and fine-tuned the classification network with incremental quantization-aware training. We quantized all network parameters to 4-bit. The hardware-aware optimizations have minimal influence on accuracy, achieving 98.3% accuracy on the best model for the DvsGesture dataset, only 0.3% reduction compared to the best model without hardware-aware optimizations.

IV-D2 Hardware Implementation and Benchmarking

The hardware-optimized TRIP algorithm is implemented on 9 SENECA cores. The ROI prediction network is mapped in 4 cores, including 3 cores for 3 convolutional layers (C1, C2, C3) and one core fusing the ReLU recurrent layer and output layer (C4). The ROI generation with DAP uses a single core (C5). The classification network is mapped in 4 cores, including 2 cores for 2 convolutional layers (C6, C7), one core for the fully-connected layer (C8) and one core for the output layer (C9). Our convolutional layer implementation on SENECA adopts event-driven depth-first convolution [2] and fuses Convolution, BatchNorm, MaxPool, and ReLU on a single core. To improve latency, we parallelized the three components of TRIP on SENECA as shown in Figure 6. Therefore, the classification is performed on the ROI prediction network’s output based on previous timebin inputs. We experimentally observed this approach does not affect the test accuracy. To measure the improvement in SENECA, we created an event-based CNN baseline with the similar number of parameters as two networks in TRIP combined but with higher resolution of inputs (32×32323232\times 3232 × 32).

Refer to caption
Figure 6: SENECA cores processing pipeline for TRIP

IV-D3 Results

We benchmarked TRIP on SENECA with state-of-the-art neuromorphic solutions [3, 11, 14] in Table IV. We presented single and multiple timebins results to enable the comparison with multiple solutions. Single timebin results derive from inferring one timebin of the input event stream, while multiple timebin results reflect maximum accuracy. Our solution outperforms in nearly every metric of the benchmark. Notably, we achieved 46×46\times46 × energy efficiency and 88×88\times88 × area improvements compared to the spiking CNN solution on TrueNorth. One Loihi-based solution shows better latency than ours. However, Loihi’s high parallelism results in large areas that are unscalable for high-resolution inputs. Additionally, TRIP decreases the error rate to half and improves energy consumption. Moreover, TRIP improves more than 2×2\times2 × on latency and energy compared to our baseline on SENECA. Since the area is computed by multiplying the used cores with the area/core, TRIP has a higher area cost since uses two additional SENECA cores than the baseline.

To further analyze the behavior of TRIP in the SENECA neuromorphic processor, we visualized the details of each SENECA core in Figure 6, including memory costs, energy costs, and processing durations. The depth-first CNNs make it possible to parallelize the layers in a pipelined fashion within a single timebin, reducing the inference latency. Moreover, by parallellizing the three TRIP components, the ROI prediction and generation introduce much less latency overhead than sequential processing.

V Conclusion and Discussion

This paper presented TRIP, a hard attention framework for efficient event-based vision processing on the neuromorphic processor. The method achieved state-of-the-art classification accuracy in multiple event-based datasets while exhibiting consistent efficiency improvements in both algorithmic analysis and actual hardware implementation. This demonstrates TRIP’s effectiveness in connecting hard attention with event-based vision and neuromorphic computing, offering a viable solution for efficient and low-cost high-resolution visual processing on neuromorphic processors.

A recent trend in sensing technologies involves the seamless integration of sensing and processing hardware to create ultra-efficient sensors with inherent in-sensor and near-sensor processing capabilities [28, 29]. Employing a similar concept, TRIP can trigger new hardware designs of event-based vision sensors. The ROI prediction network can be accelerated inside the sensory chip for near-sensor processing. The tGK and DAP for ROI generation naturally transforms into hardware circuits next to the sensory array of the event-based camera. The specialized hardware design will increase efficiency and vastly decrease the data bandwidth required by the event stream, relieving the communication burden of downstream neuromorphic processors.

We showed here that TRIP performs efficient classification by actively focusing on regions of the input space. The framework naturally extends to event-based vision tasks with higher resolutions than those we have showcased. Emerging high-resolution event-based cameras, like the 1-megapixel Prophesee camera [30], exhibit promising capabilities for addressing complex applications while requiring substantial processing. Our proposed neuromorphic hard attention solution emerges as a compelling alternative to conventional CNN solutions when aiming for an end-to-end event-based vision system tailored for edge applications.

References

  • [1] J. Yik, S. H. Ahmed, Z. Ahmed, B. Anderson, A. G. Andreou, C. Bartolozzi, A. Basu, D. d. Blanken, P. Bogdan, S. Bohte et al., “Neurobench: Advancing neuromorphic computing through collaborative, fair and representative benchmarking,” arXiv preprint arXiv:2304.04640, 2023.
  • [2] Y. Xu, K. Shidqi, G.-J. van Schaik, R. Bilgic, A. Dobrita, S. Wang, R. Meijer, P. Nembhani, C. Arjmand, P. Martinello, A. Gebregiogis, S. Hamdioui, P. Detterer, S. Traferro, M. Konijnenburg, K. Vadivel, M. Sifalakis, G. Tang, and A. Yousefzadeh, “Optimizing event-based neural networks on digital neuromorphic architecture: A comprehensive design space exploration,” Frontiers in Neuroscience, vol. 18, 2024.
  • [3] B. Rueckauer, C. Bybee, R. Goettsche, Y. Singh, J. Mishra, and A. Wild, “Nxtf: An api and compiler for deep spiking neural networks on intel loihi,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 18, no. 3, pp. 1–22, 2022.
  • [4] H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with a third-order boltzmann machine,” in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23.   Curran Associates, Inc., 2010.
  • [5] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Advances in neural information processing systems, 2014. [Online]. Available: https://arxiv.longhoe.net/pdf/1412.7755.pdf
  • [6] Y. Chai, “Patchwork: A patch-wise attention network for efficient object detection and segmentation in video streams,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3415–3424.
  • [7] G. Elsayed, S. Kornblith, and Q. V. Le, “Saccader: Improving accuracy of hard attention models for vision,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [8] G. Cohen, S. Afshar, G. Orchard, J. Tapson, R. Benosman, and A. van Schaik, “Spatial and temporal downsampling in event-based visual classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 5030–5044, 2018.
  • [9] G. Tang, A. Safa, K. Shidqi, P. Detterer, S. Traferro, M. Konijnenburg, M. Sifalakis, G.-J. van Schaik, and A. Yousefzadeh, “Open the box of digital neuromorphic processor: Towards effective algorithm-hardware co-design,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5.
  • [10] G. Tang, K. Vadivel, Y. Xu, R. Bilgic, K. Shidqi, P. Detterer, S. Traferro, M. Konijnenburg, M. Sifalakis, G.-J. van Schaik et al., “Seneca: building a fully digital neuromorphic processor, design trade-offs and challenges,” Frontiers in Neuroscience, vol. 17, 2023.
  • [11] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al., “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7243–7252.
  • [12] L. Müller, M. Sifalakis, S. Eissa, A. Yousefzadeh, P. Detterer, S. Stuijk, and F. Corradi, “Aircraft marshaling signals dataset of fmcw radar and event-based camera for sensor fusion,” in 2023 IEEE Radar Conference (RadarConf23).   IEEE, 2023, pp. 01–06.
  • [13] A. Subramoney, K. K. Nazeer, M. Schöne, C. Mayr, and D. Kappel, “Efficient recurrent architectures through activity sparsity and sparse back-propagation through time,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=lJdOlWg8td
  • [14] R. Massa, A. Marchisio, M. Martina, and M. Shafique, “An efficient spiking neural network for recognizing gestures with a dvs camera on the loihi neuromorphic processor,” in 2020 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2020, pp. 1–9.
  • [15] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
  • [16] F. Kong and R. Henao, “Efficient classification of very large images with tiny objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2384–2394.
  • [17] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” in International conference on machine learning.   PMLR, 2015, pp. 1462–1471.
  • [18] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, “Attention mechanisms for object recognition with event-based cameras,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2019, pp. 1127–1136.
  • [19] Z. Zhu, A. Pourtaherian, L. Waeijen, E. Bondarev, and O. Moreira, “Star: Sparse thresholded activation under partial-regularization for activation sparsity exploration,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).   IEEE, 2023, pp. 4554–4563.
  • [20] W. He, Y. Wu, L. Deng, G. Li, H. Wang, Y. Tian, W. Ding, W. Wang, and Y. Xie, “Comparing snns and rnns on neuromorphic vision datasets: Similarities and differences,” Neural Networks, vol. 132, pp. 108–120, 2020.
  • [21] S. U. Innocenti, F. Becattini, F. Pernici, and A. Del Bimbo, “Temporal binary representation for event-based action recognition,” in 2020 25th International Conference on Pattern Recognition (ICPR).   IEEE, 2021, pp. 10 426–10 432.
  • [22] Z. Wu, H. Zhang, Y. Lin, G. Li, M. Wang, and Y. Tang, “Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6249–6262, 2022.
  • [23] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Frontiers in neuroscience, vol. 9, p. 437, 2015.
  • [24] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y. Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023.
  • [25] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning.   PMLR, 2019, pp. 6105–6114.
  • [26] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” Ieee Micro, vol. 38, no. 1, pp. 82–99, 2018.
  • [27] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam et al., “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 34, no. 10, pp. 1537–1557, 2015.
  • [28] F. Zhou and Y. Chai, “Near-sensor and in-sensor computing,” Nature Electronics, vol. 3, no. 11, pp. 664–671, 2020.
  • [29] Y. Zhou, J. Fu, Z. Chen, F. Zhuge, Y. Wang, J. Yan, S. Ma, L. Xu, H. Yuan, M. Chan et al., “Computational event-driven vision sensors for in-sensor spiking neural networks,” Nature Electronics, pp. 1–9, 2023.
  • [30] E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 639–16 652, 2020.

Appendix A Experiment Details

A-A DVS Gesture Dataset

The following section details the network architecture and training parameters used in the DVS Gesture experiments.

A-A1 Network Details

Table V shows the sequentially ordered layers from input to output of the ROI prediction network used with the DVS Gesture dataset. Similarly, Table VI shows the layer information of the classification network. Every batch normalization layer is followed by a ReLU activation function.

TABLE V: Layer overview of ROI prediction net, for TRIP network used in DVS Gesture results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 8 0
Convolutional 16×\times×16 2, 32 3 1
Maxpooling 16×\times×16 32 2 0
Batch Normalization 8×\times×8 32
Convolutional 8×\times×8 32, 64 3 1
Maxpooling 8×\times×8 64 2 0
Batch Normalization 4×\times×4 64
Convolutional 4×\times×4 64, 128 3 1
Maxpooling 4×\times×4 128 2 0
Batch Normalization 2×\times×2 128
ReLU RNN 512 256
Fully Connected 256 3
TABLE VI: Layer overview of classification net, for TRIP network used in DVS Gesture results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Convolutional 12×\times×12 2, 32 3 1
Maxpooling 12×\times×12 32 2 0
Batch Normalization 6×\times×6 32
Convolutional 6×\times×6 32, 64 3 1
Maxpooling 6×\times×6 64 2 0
Batch Normalization 3×\times×3 64
Fully Connected 576 256
Fully Connected 256 11

A-A2 Training details

Table VII lists the training hyperparameters used in the DVS Gesture experiments. The accuracy reported in the paper is the mean of the best accuracies obtained from five separate experiments with five random parameter initializations. The training dataset is augmented using the torchvision transforms package. The data augmentation randomly scales samples with a scaling factor between 0.6 and 1.0, applies the random perspective transformation with a distortion parameter of 0.5, and randomly rotates samples between 0 and 25 degrees. SpikingJelly preprocesssing is used to split samples by frame into 32 timebins.

TABLE VII: Training parameters used in DVS Gesture results.
Parameter Value
Learning rate 0.0001
Training dataset size 1176
Test dataset size 288
Training batch size 32
Testing batch size 32
Number of epochs 1000
Optimizer Adam

A-B Marshalling Signals Dataset

On the Marshalling Signals dataset, we employ residual connection to concatenate the RNN output to the input of the first FC layer in the classification network for improved learning stability and accuracy. The detailed parameters of the sequentially ordered layers in the ROI prediciton network and in the classification network are listed in Table VIII and IX, respectively. Every batch normalization layer is followed by a ReLU activation function.

A-B1 Network Details

TABLE VIII: Layer overview of ROI prediction net, for TRIP network used in Marshalling Signals results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 346×\times×224 2 8 0
Convolutional 43×\times×28 2, 32 3 1
Maxpooling 43×\times×28 32 2 0
Batch Normalization 21×\times×14 32
Convolutional 21×\times×14 32, 64 3 1
Maxpooling 21×\times×14 64 2 0
Batch Normalization 10×\times×7 64
Convolutional 10×\times×7 64, 128 3 1
Maxpooling 10×\times×7 128 2 0
Batch Normalization 5×\times×3 128
ReLU RNN 1920 512
Fully Connected 512 3
TABLE IX: Layer overview of classification net, for TRIP network used in Marshalling Signals results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Convolutional 12×\times×12 2, 32 3 1
Maxpooling 12×\times×12 32 2 0
Batch Normalization 6×\times×6 32
Convolutional 6×\times×6 32, 64 3 1
Maxpooling 6×\times×6 64 2 0
Batch Normalization 3×\times×3 64
Fully Connected 576 256
Fully Connected 256 11

A-B2 Training details

The training hyperparameters used in the Marhsalling Signals dataset are listed in Table X. The accuracy reported in the paper is the best obtained accuracy during a single experiment. The training dataset is augmented using the torchvision transforms package. The data augmentation randomly scales samples with a scaling factor between 0.6 and 1.0, applies the random perspective transformation with a distortion parameter of 0.5, and randomly rotates samples between 0 and 25 degrees.

TABLE X: Training parameters used in Marshalling Signals results.
Parameter Value
Learning rate 0.001
Training dataset size 11,040
Test dataset size 930
Training batch size 128
Testing batch size 128
Number of epochs 1000
Optimizer Adam

A-C Synthetic Dataset Based on N-MNIST

A-C1 Network Details

The layer-by-layer network details of the ROI prediction network and classification network used for the synthetic dataset based on N-MNIST are listed in Table XI and Table XII respectively. Every convolutional layer is followed by a ReLU activation function. The maxpooling layers in the classification network use a stride equal to 1. The parameters shown in parenthesis refer to parameters used by the 32×32323232\times 3232 × 32 input size TRIP; the 32×32323232\times 3232 × 32 TRIP uses a kernel size 4 (instead of 8 in the 16×16161616\times 1616 × 16 input size TRIP) in the initial downsampling maxpooling layer, and has an input size of 1600 (instead of 256 in the 16×16161616\times 1616 × 16 input size TRIP) in the ReLU RNN. The remaining network parameters are the same for both 32×32323232\times 3232 × 32 and 16×16161616\times 1616 × 16 input size TRIP.

TABLE XI: Layer overview of ROI prediction net, for TRIP network used in synthetic N-MNIST-based dataset results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 8 (4) 0
Convolutional 16×\times×16 2, 32 5 1
Maxpooling 14×\times×14 32 2 0
Convolutional 7×\times×7 32, 64 5 1
Maxpooling 5×\times×5 64 2 0
ReLU RNN 256 (1600) 256
Fully Connected 256 3
TABLE XII: Layer overview of classification net, for TRIP network used in synthetic N-MNIST based-dataset results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Convolutional 12×\times×12 2, 32 5 1
Maxpooling 10×\times×10 32 2 0
Convolutional 9×\times×9 32, 64 5 1
Maxpooling 7×\times×7 64 2 0
Fully Connected 2304 10

The network details of the 16×16161616\times 1616 × 16 input baseline network used in the experiments with the synthetic dataset based on N-MNIST are listed in Table XIII. The details of the 32×32323232\times 3232 × 32 baseline are listed in Table XIV, and the details of the 64×64646464\times 6464 × 64 baseline are listed in Table XV. Every convolutional layer is followed by a ReLU activation function.

TABLE XIII: Layer overview 16×16161616\times 1616 × 16 input resolution baseline network used in synthetic N-MNIST-based dataset results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 8 0
Convolutional 16×\times×16 2, 16 5 2
Maxpooling 16×\times×16 16 2 0
Convolutional 8×\times×8 16, 32 5 2
Maxpooling 8×\times×8 32 2 0
Convolutional 4×\times×4 32, 64 5 2
Maxpooling 4×\times×4 64 2 0
Convolutional 2×\times×2 64, 128 5 2
Maxpooling 2×\times×2 128 2 0
ReLU RNN 128 128
Fully Connected 128 64
Fully Connected 64 10
TABLE XIV: Layer overview 32×32323232\times 3232 × 32 input resolution baseline network used in synthetic N-MNIST-based dataset results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 4 0
Convolutional 32×\times×32 2, 16 5 2
Maxpooling 32×\times×32 16 2 0
Convolutional 16×\times×16 16, 32 5 2
Maxpooling 16×\times×16 32 2 0
Convolutional 8×\times×8 32, 64 5 2
Maxpooling 8×\times×8 64 2 0
Convolutional 4×\times×4 64, 128 5 2
Maxpooling 4×\times×4 128 2 0
ReLU RNN 512 384
Fully Connected 384 128
Fully Connected 128 10
TABLE XV: Layer overview 64×64646464\times 6464 × 64 input resolution baseline network used in synthetic N-MNIST-based dataset results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 2 0
Convolutional 64×\times×64 2, 16 5 1
Maxpooling 62×\times×62 16 2 0
Convolutional 31×\times×31 16, 32 5 1
Maxpooling 29×\times×29 32 2 0
Convolutional 15×\times×15 32, 64 5 1
Maxpooling 13×\times×13 64 2 0
Convolutional 6×\times×6 64, 128 5 1
Maxpooling 4×\times×4 128 2 0
ReLU RNN 512 384
Fully Connected 384 128
Fully Connected 128 10

A-C2 Training details

The training hyperparameters used in all of the experiments with the synthetic dataset based on N-MNIST are listed in table XVI. The accuracies reported in the paper are the mean of the best accuracies obtained from five different experiments with five random parameter initializations.

TABLE XVI: Training parameters used in synthetic N-MNIST based-dataset results.
Parameter Value
Learning rate 0.0006
Training dataset size 50,000
Validation dataset size 10,000
Test dataset size 10,000
Training batch size 32
Testing batch size 64
Number of epochs 5
Optimizer Adam

Appendix B HW Benchmarking details

B-A Baseline Network

Table XVII lists the layer-by-layer network parameters of the baseline network implemented on SENECA. Every batch normalization layer is followed by a ReLU activation function.

TABLE XVII: Layer overview of baseline network used in the SENECA benchmarking results.
Layer Input Channel Kernel Padding Number of
dimension dimensions size (no. cells) neurons
Maxpooling 128×\times×128 2 4 0
Convolutional 32×\times×32 2, 32 3 1
Maxpooling 32×\times×32 32 2 0
Batch Normalization 16×\times×16 32
Convolutional 16×\times×16 32, 64 3 1
Maxpooling 16×\times×16 64 2 0
Batch Normalization 8×\times×8 64
Convolutional 8×\times×8 64, 128 3 1
Maxpooling 8×\times×8 128 2 0
Batch Normalization 4×\times×4 128
Convolutional 4×\times×4 128, 128 3 1
Maxpooling 4×\times×4 128 2 0
Batch Normalization 2×\times×2 128
Convolutional 2×\times×2 128, 128 3 1
Maxpooling 2×\times×2 128 2 0
Batch Normalization 1×\times×1 128
ReLU RNN 128 256
Fully Connected 256 11

B-B Hardware Measurement and Comparison

All hardware-related measurements were performed in gate-level simulation using industry-standard ASIC simulation and power measurement tools (Cadence Xcelium and Cadence JOULES) for GF-22222222nm FDX technology node (in the typical corner 0.80.80.80.8V and 25252525C, no back-biasing). The power results are accurate within 15% of signoff power and include the total power consumption of the chip, i.e. both dynamic and static power. The latency results are cycle-accurate with a design frequency of 500 MHz. Same with other compared results, we have not included the I/O power consumption and latency in the reported results. In the reference comparison with other chips, Loihi energy results only includes dynamic power and TrueNorth energy result includes the total power.

Appendix C Sample Visualizations

C-A DVS Gesture

Figure 7 shows one example per gesture class in the DVS Gesture dataset of a test dataset sample. The first five sequentially ordered timebins from each sample is shown starting from the left, and the ROI receptive field is visualized as a yellow square superimposed on the image.

Refer to caption
Figure 7: Samples of DVS Gesture dataset with ROI receptive field superimposed as yellow square.

C-B Marshalling Signals

Example gestures from every class in the Marshalling Signals test dataset are visualized in Figures 8 and 9. The test dataset does not contain every possible combination of distance and gesture; every unique combination that occurs is shown in the figures. The distance labels indicate the number of centimeters from the camera which the gesture was recorded from. The gestures with distance label ”xxx” are samples from real-world output distribution data with unknown distance.

Refer to caption
Figure 8: Samples of Marshalling Signals with ROI receptive field superimposed as yellow square.
Refer to caption
Figure 9: Samples of Marshalling Signals with ROI receptive field superimposed as yellow square.

C-C Synthetic Dataset Based on N-MNIST

An example testing sample from each digit class of the synthetic dataset based on N-MNIST is visualized in Figure 10, together with the ROI receptive field as a yellow superimposed square. The N-MNIST dataset is recorded in such a way that the digit disappears and re-appears between timebins. The first timebins of a sample are empty and the digit cannot be seen until it appears a few timebins later. It can be noted in Figure 10 that the ROI receptive field initially locates itself somewhere in the center, and as soon as the digit begins to appear it locates the ROI receptive field on the location of the digit. In the case of digit 6, a piece of structured noise appears before the digit 6, and the receptive field begins moving towards the noise. However, once the digit has appeared, the receptive field changes direction and moves towards the digit instead.

Refer to caption
Figure 10: Samples of synthetic dataset based on N-MNIST with ROI receptive field superimposed as yellow square.