TRIP

Cina Arjmand¹, Yingfu Xu¹, Kevin Shidqi¹, Alexandra F. Dobrita¹, Kanishkan Vadivel¹,
Paul Detterer¹, Manolis Sifalakis¹, Amirreza Yousefzadeh² and Guangzhi Tang^3,† ¹ imec, Eindhoven, The Netherlands ² EEMCS, University of Twente, Enschede, The Netherlands
³ DACS, Maastricht University, Maastricht, The Netherlands

Abstract

Neuromorphic processors are well-suited for efficiently handling sparse events from event-based cameras. However, they face significant challenges in the growth of computing demand and hardware costs as the input resolution increases. This paper proposes the Trainable Region-of-Interest Prediction (TRIP)^∗, the first hardware-efficient hard attention framework for event-based vision processing on a neuromorphic processor. Our TRIP framework actively produces low-resolution Region-of-Interest (ROIs) for efficient and accurate classification. The framework exploits sparse events’ inherent low information density to reduce the overhead of ROI prediction. We introduced extensive hardware-aware optimizations for TRIP and implemented the hardware-optimized algorithm on the SENECA neuromorphic processor. We utilized multiple event-based classification datasets for evaluation. Our approach achieves state-of-the-art accuracies in all datasets and produces reasonable ROIs with varying locations and sizes. On the DvsGesture dataset, our solution requires $46\times$ less computation than the state-of-the-art while achieving higher accuracy. Furthermore, TRIP enables more than $2\times$ latency and energy improvements on the SENECA neuromorphic processor compared to the conventional solution.

I Introduction

Low-power and low-latency event-based vision is uniquely suited for edge applications. Given the efficiency of sensing, develo** equally efficient processing becomes crucial for optimizing the performance of edge solutions. Since the event-based camera inherently generates sparse data, exploiting this sparsity is essential for enhancing the processing efficiency. Neuromorphic computing offers event-driven solutions to process sparse data streams efficiently, making it a natural fit for event-based vision [1, 2]. However, with the growing resolution of event-based cameras, neuromorphic computing faces computing and hardware cost challenges [1]. These challenges are further amplified when employing Convolutional Neural Networks (CNNs), as the computational expenses and on-chip memory demands for processing CNNs on neuromorphic processors increase with input resolution [3].

Refer to caption — Figure 1: Overview of TRIP performing event-based vision classification on the SENECA neuromorphic processor.

To address the challenges of high-resolution visual processing, one approach is the hard attention algorithm, which selectively focuses on regions of an input image for processing [4, 5]. Compared to uniformly downsampling the entire input, the hard attention mechanism actively chooses regions of interest (ROI) with more critical information, improving accuracy while limiting the computing and memory costs of network processing. However, hard attention algorithms require an additional neural network to predict the ROI accurately. This demands sophisticated training methods and introduces additional overheads for visual processing. Therefore, the benefits gained from processing a reduced-dimension ROI can be offset by the high costs of ROI prediction [6]. The trade-off becomes particularly pronounced as the complexity of the scene increases, potentially negating the efficiency gains in ROI processing [7].

Interestingly, the inherent sparsity of event-based vision reduces the information density of scenes [8], which can potentially mitigate the hard attention overhead on ROI prediction. This characteristic opens up opportunities for efficient event-based vision processing, especially when hard attention is integrated with neuromorphic processors. By reducing input dimensionality, the hard attention algorithm can significantly reduce the computational and memory demands of CNNs on neuromorphic processors. Moreover, the event-driven processing further diminishes the latency and energy overhead associated with hard attention when utilizing CNNs with sparse activation [9]. This synergy opens prospects for tailoring hard attention algorithms on the neuromorphic processor.

In this paper, we propose the Trainable Region-of-Interest Prediction (TRIP) framework for hardware-efficient event-based vision processing on the neuromorphic processor. Our TRIP framework performs efficient ROI prediction with low-resolution event streams and supports end-to-end training by employing differentiable truncated Gaussian kernels (tGK) for ROI generation. We introduced hardware-aware optimizations for TRIP to improve the algorithm’s hardware efficiency without sacrificing accuracy. We implemented the hardware-optimized TRIP algorithm on the SENECA neuromorphic processor [10] and evaluated our method on event-based classification datasets [11, 12]. Our method achieves state-of-the-art accuracies while reducing the computation cost by 46 $\times$ compared to the state-of-the-art efficient algorithm [13]. Compared to neuromorphic solutions on Intel’s Loihi and IBM’s TrueNorth neuromorphic processors [3, 11, 14], our TRIP-based solution significantly reduces the area and energy consumption while having higher accuracy.

II Related Works

II-A Hard Attention Visual Processing

Hard attention strategies for restricting computations by directing image processing towards relevant regions of input space have long been explored in computer vision. Early models analyze low-level image features to predict regions of high saliency based on variations in pixel intensity [15]. Later works increasingly emphasized the task of salient region prediction as an action selection policy [5], iteratively improving predictions over time. Reinforcement learning (RL) algorithms have been adopted in hard attention to learn the optimal policy for placing a sensor with limited bandwidth on a given input region [7, 6, 16]. While RL-based hard attention is effective, the training complexity poses major challenges for hardware-aware optimization.

Deep Recurrent Attention Writer (DRAW) uses recurrent units within a variational autoencoder to iteratively predict salient regions of input images [17]. Importantly, DRAW uses a differentiable mechanism for generating an ROI with Gaussian kernels, enabling end-to-end backpropagation training without using RL. Neuromorphic DRAW applied the differentiable crop** of DRAW to event-based classification tasks to improve accuracy by filtering out irrelevant events [18]. Our TRIP framework leverages DRAW’s Gaussian kernels to facilitate differentiable hard attention while introducing hardware-efficient algorithm designs for event-based vision processing on neuromorphic processors.

II-B Event-based CNN and SENECA Neuromorphic Processor

Event-based CNN, trained by specialized activation regularization methods, has high activation sparsity within each network layer [19]. SENECA is a multi-core embedded digital neuromorphic processor specialized in processing event-based CNNs [10]. It performs event-driven computation that exploits the sparsity in sensory inputs and network activations. Additionally, it executes data-flow processing across cores, increasing the parallelism of network processing and diminishing the memory cost for neural activations. Event-driven depth-first convolution is a unique scheduling method SENECA supports for event-based CNNs [2]. The method prioritizes the network’s layer dimension by consuming neural activation events right after their generation. Therefore, it maximizes the neuromorphic processor’s benefits on parallelism and latency. Our TRIP framework with event-based CNN maximizes the hardware efficiency of hard attention on SENECA by exploiting the hardware advantages.

III Method

III-A TRIP: Trainable Region-of-Interest Prediction

We propose the Trainable Region-of-Interest Prediction (TRIP) framework for efficient event-based classification. It uses hard attention within an event-driven neuromorphic processing pipeline. The framework efficiently classifies event streams using an actively generated ROI that is predicted from the input events. An ROI’s receptive field covers a small region of the event-based camera’s field of view. As shown in Figure 2, our TRIP framework consists of three subsequent components: ROI prediction, ROI generation, and classification. The ROI prediction component consists of an event-based CNN that determines the location and receptive field of the ROI. It predicts the ROI parameters using a downsampled low-resolution input, reducing the processing overhead of ROI prediction. The ROI generation component generates the cropped ROI using the predicted parameters. It uses an $N\times N$ grid of differentiable truncated Gaussian kernels (tGK) to produce a fixed $N\times N$ output from a varying-size receptive field. This ensures consistently low processing cost of classification. Moreover, we introduce dynamic average pooling (DAP) to replace tGK for efficient inference on the embedded neuromorphic processor. The classification component consists of an event-based CNN that performs classification on the ROI. The entire framework is differentiable, allowing it to be trained end-to-end. For efficient computing on SENECA, we increase the activation sparsity of the event-based networks during training.

III-B ROI Prediction

The ROI prediction component produces the ROI parameters based on the downsampled input events from max-pooling. The ROI prediction network outputs three scalar values. These values are decoded to determine the ROI location and receptive field as follows,

g_{x}=\frac{A}{2}\cdot(tanh(\hat{g_{x}})+1)

(1)

g_{y}=\frac{B}{2}\cdot(tanh(\hat{g_{y}})+1)

(2)

\delta=S\cdot(sigmoid(\hat{\delta})+1)

(3)

where $\hat{g_{x}},\hat{g_{y}},\hat{\delta}$ are the raw scalar outputs from the ROI prediction network, ( $g_{x},g_{y}$ ) is the center location of the overall receptive field formed by the $N\times N$ grid of tGK, $\delta$ is the distance between two adjacent tGK, $A$ and $B$ are the image width and height, and $S$ is a distance scaling factor. The $g_{x}$ and $g_{y}$ parameters always initialize in the center of the image at the start of training and can move across the entire input space. The variable $\delta$ allows control over the size of the receptive field of the ROI.

III-C ROI Generation

We employed tGK for ROI generation, an efficient variation of the method introduced in DRAW [17]. The ROI generation component generates the $N\times N\times 2$ fixed-resolution input ROI event streams for classification, in which $N$ is the width and height of the ROI and $2$ is the polarity channel. The component uses $N\times N$ differentiable tGK to compute the ROI during training. The 2D mean positions of the tGK are computed according to the predicted center location of the overall receptive field as follows,

\mu_{x}^{i}=g_{x}+(i-\frac{N}{2}-0.5)\cdot\delta,\ i\in[0,N-1]

(4)

where $\mu_{x}^{i}$ is the mean x-axis position of the tGK on the $i^{th}$ column. The mean y-axis position of the tGK can be computed using the same equation with $g_{y}$ . Eventually, each tGK has a two-dimensional mean position $(\mu_{x}^{i},\mu_{y}^{j})$ , in which $i$ and $j$ are the column and row index. Here, we assume $N$ is an even number.

By knowing the mean positions of the tGK, we compute the weight of the tGK corresponding to each pixel location of the input events as follows,

F_{x}^{i}[n]=\begin{cases}\exp({\frac{(n-\mu_{x}^{i})^{2}}{2\sigma}})&\text{% for }n\in[\mu_{x}^{i}-\frac{\theta}{2},\ \mu_{x}^{i}+\frac{\theta}{2}]\\ 0&\text{otherwise}\end{cases}

(5)

where $F_{x}^{i}[n]$ is the x-dimension weight component of tGK on the $i^{th}$ column corresponding to pixel locations at column $n$ , $\sigma$ is the variance which is a pre-defined parameter, and $\theta$ is the size of the tGK with non-zero weights. The y-dimension weight component is computed in a similar manner.

Each ROI input event value to the classification network is computed by the corresponding tGK as follows,

v_{(x_{i},y_{j})}=\mathbf{F_{x}^{i}}\cdot\mathbf{I}\cdot\mathbf{{F_{y}^{j}}}

(6)

where $v_{(x_{i},y_{j})}$ is the value of the event at location $(x_{i},y_{j})$ of the $N\times N\times 2$ input to the classification network, $\mathbf{F_{x}^{i}}$ and $\mathbf{{F_{y}^{j}}}$ are the weights for the corresponding tGK, and $\mathbf{I}$ are the binned raw input events from one polarity. The two polarity channels of the classification inputs are computed using the same equation.

Compared to Gaussian kernels introduced by DRAW, our tGK significantly reduces the computation required for ROI generation while maintaining differentiable. Specifically, our adoption of tGK reduces the computational complexity of ROI generation from $O(AB)$ to $O(\theta^{2})$ by skip** the pixel locations with insignificant weights. Since $\theta$ is at least ten times smaller than $A$ and $B$ in practice, the tGK can be orders of magnitude more efficient than Gaussian kernels.

III-D Hardware-Efficient Dynamic Average Pooling

The tGK can be accelerated by customized application-specific integrated circuit (ASIC) designs. However, its efficiency is hard to achieve on the embedded CPU within our targeted neuromorphic processor. Though the number of computations is small, the overheads of locating the non-zero elements and performing weighted operations for tGK are substantial. Firstly, assigning each input event to multiple overlap** kernels requires a complex implementation to avoid iterating all kernels, introducing significant overhead to the instruction memory. Secondly, the distance between the kernel center and the event location must be computed for each assigned event, bringing additional overhead on computation. Hence, tGK on an embedded core with limited instruction memory and compute capability is not feasible.

To mitigate the problem, we introduce Dynamic Average Pooling (DAP) as a hardware-efficient alternative for ROI generation during inference on the embedded neuromorphic processor. The DAP replaces the Gaussian kernels with simple non-overlap** average poolings. The kernel size of the average pooling changes dynamically based on the size of the overall receptive field the ROI corresponds to. We compute the range of the overall receptive field using the ROI parameters from the ROI prediction component as follows,

x_{max}=g_{x}+(\frac{N}{2}-0.5)\cdot\delta+\frac{\theta}{2}

(7)

x_{min}=g_{x}-(\frac{N}{2}+0.5)\cdot\delta-\frac{\theta}{2}

(8)

and the dynamic kernel size of each average pooling in the DAP is computed as follows,

k_{DAP}=(x_{max}-x_{min})/N

(9)

where $x_{max}$ and $x_{min}$ define the range of the receptive field on the x-axis of the raw input space. The range on the y-axis can be computed using the same equations with $g_{y}$ and the receptive field is square.

The embedded implementation of DAP is simple. A closed-form equation exists to compute the sole corresponding kernel of each input event, and the ROI generation is distance invariant. However, the ROI parameters in DAP are not differentiable. Therefore, we first train the networks with tGK, and then fine-tune the classification network with a fixed ROI prediction network and DAP.

III-E Hardware-Aware Event-based CNN

The event-driven neuromorphic processor exploits activation sparsities in neural networks by only processing non-zero activations. Therefore, input to each layer for synaptic operation is supposed to be as sparse as possible. To maximize the efficiency of TRIP on the event-driven neuromorphic processor, we adopt event-based CNNs for ROI prediction and classification. Unlike regular CNN, our event-based convolutional layer for the neuromorphic processor performs BatchNorm and MaxPool before the ReLU activation, outputting sparse events straight to the subsequent layer for synaptic integration. Moreover, we used ReLU function in the vanilla RNN for sparse recurrent processing. Furthermore, we perform hardware-aware optimizations on the event-based CNNs. The optimizations comprise sparsity-aware and quantization-aware training, reducing projected computation cost and memory requirement on the hardware.

To increase the activation sparsity of event-based CNNs in TRIP, we adopt the $L1$ regularization loss [19] on the activation values of the layers that have ReLU as the activation function. The loss encourages the network to reduce the activation values so as to have fewer non-zero activations and increase the sparsity. Additionally, we use quantization-aware training to reduce weight precision to 4 bits [2]. There is a shared power-of-two scaling factor $s$ for all the weights of the same layer. During on-chip computation, a weight value is obtained by multiplying the saved 4-bit integer with $2^{s}$ . The quantized parameters reduce the on-chip memory required for network deployments and the computation cost of synaptic integration.

TABLE I: Performance comparisons on the DvsGesture dataset.

Architecture	Input Resolution	Param	Effective MACs	Accuracy [%]	Accuracy [%]
			(Single Timebin)	(mean $\pm$ std)	(Maximum)
LSTM [20]	32 $\times$ 32	7.4M	3.9M	–	86.8
AlexNet+LSTM[21]	128 $\times$ 128	8.3M	601.3M	–	97.7
CNN+EGRU [13]	128 $\times$ 128	4.8M	80.6M	97.3 $\pm$ 0.4	97.8
ConvLIAF [22]	32 $\times$ 32	0.22M	113.3M	–	97.6
TRIP (Ours)	16 $\times$ 16+12 $\times$ 12	0.46M	1.75M	97.6 $\pm$ 0.5	98.6

IV Experiment and Results

We benchmarked the performance of TRIP on challenging event-based classification datasets and with the SENECA neuromorphic processor. Firstly, we experimented on the widely used DvsGesture dataset [11] to demonstrate the effectiveness of TRIP in terms of accuracy, model size, and algorithmic computing cost. Secondly, we used the Marshalling Signals gesture recognition dataset [12] to evaluate TRIP’s robustness towards samples at varying distances from the event-based camera. Our results demonstrate that the ROI prediction is dynamically adaptable to varying distances. Thirdly, we synthetically generated a noisy, high-resolution event-based dataset based on N-MNIST [23] with digits in varying sizes and locations. Using this dataset, we validated the effectiveness and overhead of TRIP compared to baselines with similar cost or accuracy. Finally, we implemented our hardware-optimized TRIP algorithm on the SENECA neuromorphic processor and measured the energy, latency, and effective area of the solution.

IV-A DvsGesture Dataset

Gesture recognition is an ideal task for evaluating approaches with hard attention, as a compact region of the input can provide sufficient information for classification. The DvsGesture dataset [11] enables us to compare our method with other state-of-the-art solutions on the task of gesture recognition with an event-based camera.

IV-A1 Dataset and Network Overview

The dataset is recorded using the DVS128 event-based camera with $128\times 128$ resolution. It consists of 11 gesture classes in 1176 training sessions and 288 testing sessions. Each session includes a subject repeatedly performing the same gesture. We preprocessed each session using SpikingJelly [24] into an event sample of 32 timebins, in which the events from the same pixel location are accumulated together within each timebin. We performed data augmentation during training to randomly scale, rotate, and spatially shift training samples. We downsampled the input resolution to $16\times 16$ for ROI prediction. The ROI prediction network comprises three convolutional layers, a ReLU recurrent layer, and an output layer. We used $12\times 12$ tGKs to generate the ROI input for classification. The classification network comprises two convolutional layers, a fully-connected hidden layer, and an output layer.

IV-A2 Results

We compared accuracies, number of parameters, and effective MAC operations with other state-of-the-art methods in Table I. The effective MAC counts the averaged non-zero multiply-accumulate operations within all components of TRIP for processing a single timebin of the event stream, reflecting the computing cost on event-driven neuromorphic processors. Our TRIP framework achieves state-of-the-art accuracy while reducing the effective MAC by $46\times$ compared to the lowest among the other state-of-the-art approaches. TRIP achieves tremendous computational efficiency gains through two key differentiators: firstly, by operating on considerably lower input resolutions compared to other CNN-based methods, and secondly, by utilizing less complex network architectures while processing a reduced input space with less irrelevant information.

We visualized the receptive fields used for generating ROIs for classification in Figure 3. The visualization helps for interpreting the decision process of TRIP and further explains the reason behind TRIP’s efficiency advantage. By visually inspecting the samples, we can see the ROI prediction network learns to track the gestures intelligently and focus on salient regions of the input space. For example, in the “left hand clockwise” gesture, the ROI’s receptive field tracks the arm’s movement, making the classification network easier to make a decision.

IV-B Marshalling Signals Dataset

The Marshalling Signals dataset [12] is more recent, less explored, and more difficult than DvsGesture. The dataset presents gestures at multiple distances from the event-based camera. Therefore, it allows us to further test the ROI prediction, particularly its ability to adjust to varying sizes.

IV-B1 Dataset and Network Overview

The Marshalling Signals dataset [12] is recorded using the DAVIS 346 event-based camera with $346\times 224$ resolution. It contains 10 gesture classes in 11,040 training samples and 930 testing samples. Each sample is one gesture presented in a 960 ms timebin. Each gesture is presented in 8 evenly spaced distances from the camera ranging from 1.5m to 4.5m. We adopt the same network architectures as the DvsGesture task with a higher dimension ReLU recurrent layer in the ROI prediction network. We downsampled the input solution to $43\times 28$ for ROI prediction and used $12\times 12$ tGKs for ROI generation.

IV-B2 Results

We compared the performance of our model with the previous results in Table II. Since [12] uses regular CNN architectures, we used FLOPs as an efficiency metric, without considering the activation sparsities in our event-based CNNs. Our TRIP framework achieves better accuracy while reducing the FLOPs by $18\times$ compared to EfficientNet [25]. By visualizing ROI’s receptive fields for different distances in Figure 4, we show that the ROI prediction can adjust the ROI size for classification to include only the relevant region of the input space.

TABLE II: Performance comparisons on the Marshalling Signals dataset.

Architecture	Param	FLOPs	Accuracy [%]
ResNet18 [12]	11.7M	1810M	74.6
EfficientNet-B1 [12]	7.794M	690M	82.6
TRIP (Ours)	4.13M	37.0M	83.6

IV-C Synthetic Dataset based on N-MNIST

To study the effects of the reduced input resolutions and the hard attention overheads in a controlled setup, we synthetically generated a dataset based on the N-MNIST dataset [23]. The generated dataset enables us to test the performance of TRIP under different input resolutions and structured event noises.

IV-C1 Dataset Generation and Network Overview

We generated the synthetic N-MNIST dataset by randomly scaled event streams of $34\times 34$ resolution N-MNIST digits on arbitrary locations of a $128\times 128$ canvas. The scaling factor for each sample is randomly selected between 1 to 2. We add structured event noises by randomly selecting 8 other digits from the dataset, crop** a random $8\times 8$ subsection of each digit, and placing the subsections in random locations on the canvas. Figure 5 shows some examples of the generated samples. The synthetic dataset has the same number of samples as the original N-MNIST dataset, including 60,000 training and 10,000 testing samples.

We used the same network architectures in TRIP as the DvsGesture task but with only 2 convolutional layers for the ROI prediction network. We used $12\times 12$ tGKs for ROI generation. The baseline networks have the same number of layers as TRIP, which comprises 4 convolutional layers, a ReLU recurrent layer, a fully-connected layer, and an output layer. We tested different input resolutions for the baselines and TRIP’s ROI prediction, including $16\times 16$ , $32\times 32$ , and $64\times 64$ . The baseline networks have varying layer dimensions based on the input resolution.

IV-C2 Results

We compared the performance of our model with the baseline models on different input resolutions in Table III. Comparing the baseline networks using one level higher input resolutions ( $16\times 16\rightarrow 32\times 32$ and $32\times 32\rightarrow 64\times 64$ ), TRIP achieves higher or similar accuracies with reductions in FLOPs. This shows the low input resolution required by TRIP to maintain high accuracy compensates for the hard attention overheads introduced by the ROI prediction and generation. Moreover, the visualization results in Figure 5 shows the ROI prediction network can handle inputs with structured noises which share similar features with the digits and hard to differentiate in low resolution.

TABLE III: Performance comparisons on synthetic N-MNIST dataset.

Architecture	Param	FLOPs	Accuracy [%]
			(mean $\pm$ std)
Baseline (16x16)	0.31M	6.0M	71.8 $\pm$ 2.3
Baseline (32x32)	0.67M	24.4M	93.0 $\pm$ 0.6
Baseline (64x64)	0.67M	57.4M	96.2 $\pm$ 0.9
TRIP (16x16)	0.30M	16.0M	95.4 $\pm$ 0.4
TRIP (32x32)	0.65M	28.0M	96.1 $\pm$ 0.3

IV-D Neuromorphic Processor Deployment

TABLE IV: Comparison with state-of-the-art neuromorphic implementations on the DvsGesture dataset.

					Single Timebin			Multiple Timebins
Hardware	Solutions	Technology	Core	Area	Latency	$E_{inf}$	Accuracy	Latency	$E_{inf}$	Accuracy
			[#]	[mm²]	[ms]	[uJ]	[%]	[ms]	[uJ]	[%]
Loihi [26]	Spiking CNN [3]	Intel 14 nm	$>$ 20	$>$ 8.20	11	–	89.6	–	–	–
Loihi [26]	Spiking CNN [14]	Intel 14 nm	59	24.19	–	–	–	22.0	2731	96.2
TrueNorth [27]	Spiking CNN [11]	Samsung 28 nm	3838	383.8	–	–	91.8	104.6	18702	94.6
SENECA [10]	Event-based CNN	GF FDX 22 nm	7	3.29	–	–	–	78.9	1069.2	97.3
SENECA [10]	TRIP	GF FDX 22 nm	9	4.23	2.7	35.86	91.1	25.8	430.32	98.3

To accurately assess the hardware efficiency of TRIP, we implemented our hardware-optimized TRIP algorithm on the SENECA neuromorphic processor [10]. To compare with state-of-the-art neuromorphic solutions on event-based vision, we used the DvsGesture dataset [11] to benchmark the performance of our solution in terms of accuracy, latency, energy consumption, and hardware’s effective area.

IV-D1 Hardware-optimized TRIP

The hardware-aware optimizations for TRIP include sparsity-aware training on event-based CNNs, quantization-aware training to get network parameters in low precision, and utilizing DAP for ROI generation. The optimization process comprises three steps. First, we performed sparsity-aware training on pre-trained networks in TRIP to reduce the number of activations in event-based CNNs. Second, we conducted quantization-aware training on the ROI prediction network. The incremental quantization-aware training iteratively quantizes and trains each layer with the straight-through gradient estimator. The training freezes optimally quantized layers and trains the remaining layers. Third, we substituted the truncated Gaussian kernels with DAP and fine-tuned the classification network with incremental quantization-aware training. We quantized all network parameters to 4-bit. The hardware-aware optimizations have minimal influence on accuracy, achieving 98.3% accuracy on the best model for the DvsGesture dataset, only 0.3% reduction compared to the best model without hardware-aware optimizations.

IV-D2 Hardware Implementation and Benchmarking

The hardware-optimized TRIP algorithm is implemented on 9 SENECA cores. The ROI prediction network is mapped in 4 cores, including 3 cores for 3 convolutional layers (C1, C2, C3) and one core fusing the ReLU recurrent layer and output layer (C4). The ROI generation with DAP uses a single core (C5). The classification network is mapped in 4 cores, including 2 cores for 2 convolutional layers (C6, C7), one core for the fully-connected layer (C8) and one core for the output layer (C9). Our convolutional layer implementation on SENECA adopts event-driven depth-first convolution [2] and fuses Convolution, BatchNorm, MaxPool, and ReLU on a single core. To improve latency, we parallelized the three components of TRIP on SENECA as shown in Figure 6. Therefore, the classification is performed on the ROI prediction network’s output based on previous timebin inputs. We experimentally observed this approach does not affect the test accuracy. To measure the improvement in SENECA, we created an event-based CNN baseline with the similar number of parameters as two networks in TRIP combined but with higher resolution of inputs ( $32\times 32$ ).

IV-D3 Results

We benchmarked TRIP on SENECA with state-of-the-art neuromorphic solutions [3, 11, 14] in Table IV. We presented single and multiple timebins results to enable the comparison with multiple solutions. Single timebin results derive from inferring one timebin of the input event stream, while multiple timebin results reflect maximum accuracy. Our solution outperforms in nearly every metric of the benchmark. Notably, we achieved $46\times$ energy efficiency and $88\times$ area improvements compared to the spiking CNN solution on TrueNorth. One Loihi-based solution shows better latency than ours. However, Loihi’s high parallelism results in large areas that are unscalable for high-resolution inputs. Additionally, TRIP decreases the error rate to half and improves energy consumption. Moreover, TRIP improves more than $2\times$ on latency and energy compared to our baseline on SENECA. Since the area is computed by multiplying the used cores with the area/core, TRIP has a higher area cost since uses two additional SENECA cores than the baseline.

To further analyze the behavior of TRIP in the SENECA neuromorphic processor, we visualized the details of each SENECA core in Figure 6, including memory costs, energy costs, and processing durations. The depth-first CNNs make it possible to parallelize the layers in a pipelined fashion within a single timebin, reducing the inference latency. Moreover, by parallellizing the three TRIP components, the ROI prediction and generation introduce much less latency overhead than sequential processing.

V Conclusion and Discussion

This paper presented TRIP, a hard attention framework for efficient event-based vision processing on the neuromorphic processor. The method achieved state-of-the-art classification accuracy in multiple event-based datasets while exhibiting consistent efficiency improvements in both algorithmic analysis and actual hardware implementation. This demonstrates TRIP’s effectiveness in connecting hard attention with event-based vision and neuromorphic computing, offering a viable solution for efficient and low-cost high-resolution visual processing on neuromorphic processors.

A recent trend in sensing technologies involves the seamless integration of sensing and processing hardware to create ultra-efficient sensors with inherent in-sensor and near-sensor processing capabilities [28, 29]. Employing a similar concept, TRIP can trigger new hardware designs of event-based vision sensors. The ROI prediction network can be accelerated inside the sensory chip for near-sensor processing. The tGK and DAP for ROI generation naturally transforms into hardware circuits next to the sensory array of the event-based camera. The specialized hardware design will increase efficiency and vastly decrease the data bandwidth required by the event stream, relieving the communication burden of downstream neuromorphic processors.

We showed here that TRIP performs efficient classification by actively focusing on regions of the input space. The framework naturally extends to event-based vision tasks with higher resolutions than those we have showcased. Emerging high-resolution event-based cameras, like the 1-megapixel Prophesee camera [30], exhibit promising capabilities for addressing complex applications while requiring substantial processing. Our proposed neuromorphic hard attention solution emerges as a compelling alternative to conventional CNN solutions when aiming for an end-to-end event-based vision system tailored for edge applications.

References

[1] J. Yik, S. H. Ahmed, Z. Ahmed, B. Anderson, A. G. Andreou, C. Bartolozzi, A. Basu, D. d. Blanken, P. Bogdan, S. Bohte et al., “Neurobench: Advancing neuromorphic computing through collaborative, fair and representative benchmarking,” arXiv preprint arXiv:2304.04640, 2023.
[2] Y. Xu, K. Shidqi, G.-J. van Schaik, R. Bilgic, A. Dobrita, S. Wang, R. Meijer, P. Nembhani, C. Arjmand, P. Martinello, A. Gebregiogis, S. Hamdioui, P. Detterer, S. Traferro, M. Konijnenburg, K. Vadivel, M. Sifalakis, G. Tang, and A. Yousefzadeh, “Optimizing event-based neural networks on digital neuromorphic architecture: A comprehensive design space exploration,” Frontiers in Neuroscience, vol. 18, 2024.
[3] B. Rueckauer, C. Bybee, R. Goettsche, Y. Singh, J. Mishra, and A. Wild, “Nxtf: An api and compiler for deep spiking neural networks on intel loihi,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 18, no. 3, pp. 1–22, 2022.
[4] H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with a third-order boltzmann machine,” in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23. Curran Associates, Inc., 2010.
[5] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Advances in neural information processing systems, 2014. [Online]. Available: https://arxiv.longhoe.net/pdf/1412.7755.pdf
[6] Y. Chai, “Patchwork: A patch-wise attention network for efficient object detection and segmentation in video streams,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3415–3424.
[7] G. Elsayed, S. Kornblith, and Q. V. Le, “Saccader: Improving accuracy of hard attention models for vision,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[8] G. Cohen, S. Afshar, G. Orchard, J. Tapson, R. Benosman, and A. van Schaik, “Spatial and temporal downsampling in event-based visual classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 5030–5044, 2018.
[9] G. Tang, A. Safa, K. Shidqi, P. Detterer, S. Traferro, M. Konijnenburg, M. Sifalakis, G.-J. van Schaik, and A. Yousefzadeh, “Open the box of digital neuromorphic processor: Towards effective algorithm-hardware co-design,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5.
[10] G. Tang, K. Vadivel, Y. Xu, R. Bilgic, K. Shidqi, P. Detterer, S. Traferro, M. Konijnenburg, M. Sifalakis, G.-J. van Schaik et al., “Seneca: building a fully digital neuromorphic processor, design trade-offs and challenges,” Frontiers in Neuroscience, vol. 17, 2023.
[11] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al., “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7243–7252.
[12] L. Müller, M. Sifalakis, S. Eissa, A. Yousefzadeh, P. Detterer, S. Stuijk, and F. Corradi, “Aircraft marshaling signals dataset of fmcw radar and event-based camera for sensor fusion,” in 2023 IEEE Radar Conference (RadarConf23). IEEE, 2023, pp. 01–06.
[13] A. Subramoney, K. K. Nazeer, M. Schöne, C. Mayr, and D. Kappel, “Efficient recurrent architectures through activity sparsity and sparse back-propagation through time,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=lJdOlWg8td
[14] R. Massa, A. Marchisio, M. Martina, and M. Shafique, “An efficient spiking neural network for recognizing gestures with a dvs camera on the loihi neuromorphic processor,” in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–9.
[15] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
[16] F. Kong and R. Henao, “Efficient classification of very large images with tiny objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2384–2394.
[17] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” in International conference on machine learning. PMLR, 2015, pp. 1462–1471.
[18] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, “Attention mechanisms for object recognition with event-based cameras,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1127–1136.
[19] Z. Zhu, A. Pourtaherian, L. Waeijen, E. Bondarev, and O. Moreira, “Star: Sparse thresholded activation under partial-regularization for activation sparsity exploration,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2023, pp. 4554–4563.
[20] W. He, Y. Wu, L. Deng, G. Li, H. Wang, Y. Tian, W. Ding, W. Wang, and Y. Xie, “Comparing snns and rnns on neuromorphic vision datasets: Similarities and differences,” Neural Networks, vol. 132, pp. 108–120, 2020.
[21] S. U. Innocenti, F. Becattini, F. Pernici, and A. Del Bimbo, “Temporal binary representation for event-based action recognition,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 10 426–10 432.
[22] Z. Wu, H. Zhang, Y. Lin, G. Li, M. Wang, and Y. Tang, “Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6249–6262, 2022.
[23] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Frontiers in neuroscience, vol. 9, p. 437, 2015.
[24] W. Fang, Y. Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y. Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023.
[25] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[26] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” Ieee Micro, vol. 38, no. 1, pp. 82–99, 2018.
[27] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam et al., “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[28] F. Zhou and Y. Chai, “Near-sensor and in-sensor computing,” Nature Electronics, vol. 3, no. 11, pp. 664–671, 2020.
[29] Y. Zhou, J. Fu, Z. Chen, F. Zhuge, Y. Wang, J. Yan, S. Ma, L. Xu, H. Yuan, M. Chan et al., “Computational event-driven vision sensors for in-sensor spiking neural networks,” Nature Electronics, pp. 1–9, 2023.
[30] E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 639–16 652, 2020.

Appendix A Experiment Details

A-A DVS Gesture Dataset

The following section details the network architecture and training parameters used in the DVS Gesture experiments.

A-A1 Network Details

Table V shows the sequentially ordered layers from input to output of the ROI prediction network used with the DVS Gesture dataset. Similarly, Table VI shows the layer information of the classification network. Every batch normalization layer is followed by a ReLU activation function.

TABLE V: Layer overview of ROI prediction net, for TRIP network used in DVS Gesture results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	8	0	–
Convolutional	16 $\times$ 16	2, 32	3	1	–
Maxpooling	16 $\times$ 16	32	2	0	–
Batch Normalization	8 $\times$ 8	32	–	–	–
Convolutional	8 $\times$ 8	32, 64	3	1	–
Maxpooling	8 $\times$ 8	64	2	0	–
Batch Normalization	4 $\times$ 4	64	–	–	–
Convolutional	4 $\times$ 4	64, 128	3	1	–
Maxpooling	4 $\times$ 4	128	2	0	–
Batch Normalization	2 $\times$ 2	128	–	–	–
ReLU RNN	512	–	–	–	256
Fully Connected	256	–	–	–	3

TABLE VI: Layer overview of classification net, for TRIP network used in DVS Gesture results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Convolutional	12 $\times$ 12	2, 32	3	1	–
Maxpooling	12 $\times$ 12	32	2	0	–
Batch Normalization	6 $\times$ 6	32	–	–	–
Convolutional	6 $\times$ 6	32, 64	3	1	–
Maxpooling	6 $\times$ 6	64	2	0	–
Batch Normalization	3 $\times$ 3	64	–	–	–
Fully Connected	576	–	–	–	256
Fully Connected	256	–	–	–	11

A-A2 Training details

Table VII lists the training hyperparameters used in the DVS Gesture experiments. The accuracy reported in the paper is the mean of the best accuracies obtained from five separate experiments with five random parameter initializations. The training dataset is augmented using the torchvision transforms package. The data augmentation randomly scales samples with a scaling factor between 0.6 and 1.0, applies the random perspective transformation with a distortion parameter of 0.5, and randomly rotates samples between 0 and 25 degrees. SpikingJelly preprocesssing is used to split samples by frame into 32 timebins.

TABLE VII: Training parameters used in DVS Gesture results.

Parameter	Value
Learning rate	0.0001
Training dataset size	1176
Test dataset size	288
Training batch size	32
Testing batch size	32
Number of epochs	1000
Optimizer	Adam

A-B Marshalling Signals Dataset

On the Marshalling Signals dataset, we employ residual connection to concatenate the RNN output to the input of the first FC layer in the classification network for improved learning stability and accuracy. The detailed parameters of the sequentially ordered layers in the ROI prediciton network and in the classification network are listed in Table VIII and IX, respectively. Every batch normalization layer is followed by a ReLU activation function.

A-B1 Network Details

TABLE VIII: Layer overview of ROI prediction net, for TRIP network used in Marshalling Signals results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	346 $\times$ 224	2	8	0	–
Convolutional	43 $\times$ 28	2, 32	3	1	–
Maxpooling	43 $\times$ 28	32	2	0	–
Batch Normalization	21 $\times$ 14	32	–	–	–
Convolutional	21 $\times$ 14	32, 64	3	1	–
Maxpooling	21 $\times$ 14	64	2	0	–
Batch Normalization	10 $\times$ 7	64	–	–	–
Convolutional	10 $\times$ 7	64, 128	3	1	–
Maxpooling	10 $\times$ 7	128	2	0	–
Batch Normalization	5 $\times$ 3	128	–	–	–
ReLU RNN	1920	–	–	–	512
Fully Connected	512	–	–	–	3

TABLE IX: Layer overview of classification net, for TRIP network used in Marshalling Signals results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Convolutional	12 $\times$ 12	2, 32	3	1	–
Maxpooling	12 $\times$ 12	32	2	0	–
Batch Normalization	6 $\times$ 6	32	–	–	–
Convolutional	6 $\times$ 6	32, 64	3	1	–
Maxpooling	6 $\times$ 6	64	2	0	–
Batch Normalization	3 $\times$ 3	64	–	–	–
Fully Connected	576	–	–	–	256
Fully Connected	256	–	–	–	11

A-B2 Training details

The training hyperparameters used in the Marhsalling Signals dataset are listed in Table X. The accuracy reported in the paper is the best obtained accuracy during a single experiment. The training dataset is augmented using the torchvision transforms package. The data augmentation randomly scales samples with a scaling factor between 0.6 and 1.0, applies the random perspective transformation with a distortion parameter of 0.5, and randomly rotates samples between 0 and 25 degrees.

TABLE X: Training parameters used in Marshalling Signals results.

Parameter	Value
Learning rate	0.001
Training dataset size	11,040
Test dataset size	930
Training batch size	128
Testing batch size	128
Number of epochs	1000
Optimizer	Adam

A-C Synthetic Dataset Based on N-MNIST

A-C1 Network Details

The layer-by-layer network details of the ROI prediction network and classification network used for the synthetic dataset based on N-MNIST are listed in Table XI and Table XII respectively. Every convolutional layer is followed by a ReLU activation function. The maxpooling layers in the classification network use a stride equal to 1. The parameters shown in parenthesis refer to parameters used by the $32\times 32$ input size TRIP; the $32\times 32$ TRIP uses a kernel size 4 (instead of 8 in the $16\times 16$ input size TRIP) in the initial downsampling maxpooling layer, and has an input size of 1600 (instead of 256 in the $16\times 16$ input size TRIP) in the ReLU RNN. The remaining network parameters are the same for both $32\times 32$ and $16\times 16$ input size TRIP.

TABLE XI: Layer overview of ROI prediction net, for TRIP network used in synthetic N-MNIST-based dataset results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	8 (4)	0	–
Convolutional	16 $\times$ 16	2, 32	5	1	–
Maxpooling	14 $\times$ 14	32	2	0	–
Convolutional	7 $\times$ 7	32, 64	5	1	–
Maxpooling	5 $\times$ 5	64	2	0	–
ReLU RNN	256 (1600)	–	–	–	256
Fully Connected	256	–	–	–	3

TABLE XII: Layer overview of classification net, for TRIP network used in synthetic N-MNIST based-dataset results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Convolutional	12 $\times$ 12	2, 32	5	1	–
Maxpooling	10 $\times$ 10	32	2	0	–
Convolutional	9 $\times$ 9	32, 64	5	1	–
Maxpooling	7 $\times$ 7	64	2	0	–
Fully Connected	2304	–	–	–	10

The network details of the $16\times 16$ input baseline network used in the experiments with the synthetic dataset based on N-MNIST are listed in Table XIII. The details of the $32\times 32$ baseline are listed in Table XIV, and the details of the $64\times 64$ baseline are listed in Table XV. Every convolutional layer is followed by a ReLU activation function.

TABLE XIII: Layer overview

16\times 16

input resolution baseline network used in synthetic N-MNIST-based dataset results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	8	0	–
Convolutional	16 $\times$ 16	2, 16	5	2	–
Maxpooling	16 $\times$ 16	16	2	0	–
Convolutional	8 $\times$ 8	16, 32	5	2	–
Maxpooling	8 $\times$ 8	32	2	0	–
Convolutional	4 $\times$ 4	32, 64	5	2	–
Maxpooling	4 $\times$ 4	64	2	0	–
Convolutional	2 $\times$ 2	64, 128	5	2	–
Maxpooling	2 $\times$ 2	128	2	0	–
ReLU RNN	128	–	–	–	128
Fully Connected	128	–	–	–	64
Fully Connected	64	–	–	–	10

TABLE XIV: Layer overview

32\times 32

input resolution baseline network used in synthetic N-MNIST-based dataset results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	4	0	–
Convolutional	32 $\times$ 32	2, 16	5	2	–
Maxpooling	32 $\times$ 32	16	2	0	–
Convolutional	16 $\times$ 16	16, 32	5	2	–
Maxpooling	16 $\times$ 16	32	2	0	–
Convolutional	8 $\times$ 8	32, 64	5	2	–
Maxpooling	8 $\times$ 8	64	2	0	–
Convolutional	4 $\times$ 4	64, 128	5	2	–
Maxpooling	4 $\times$ 4	128	2	0	–
ReLU RNN	512	–	–	–	384
Fully Connected	384	–	–	–	128
Fully Connected	128	–	–	–	10

TABLE XV: Layer overview

64\times 64

input resolution baseline network used in synthetic N-MNIST-based dataset results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	2	0	–
Convolutional	64 $\times$ 64	2, 16	5	1	–
Maxpooling	62 $\times$ 62	16	2	0	–
Convolutional	31 $\times$ 31	16, 32	5	1	–
Maxpooling	29 $\times$ 29	32	2	0	–
Convolutional	15 $\times$ 15	32, 64	5	1	–
Maxpooling	13 $\times$ 13	64	2	0	–
Convolutional	6 $\times$ 6	64, 128	5	1	–
Maxpooling	4 $\times$ 4	128	2	0	–
ReLU RNN	512	–	–	–	384
Fully Connected	384	–	–	–	128
Fully Connected	128	–	–	–	10

A-C2 Training details

The training hyperparameters used in all of the experiments with the synthetic dataset based on N-MNIST are listed in table XVI. The accuracies reported in the paper are the mean of the best accuracies obtained from five different experiments with five random parameter initializations.

TABLE XVI: Training parameters used in synthetic N-MNIST based-dataset results.

Parameter	Value
Learning rate	0.0006
Training dataset size	50,000
Validation dataset size	10,000
Test dataset size	10,000
Training batch size	32
Testing batch size	64
Number of epochs	5
Optimizer	Adam

Appendix B HW Benchmarking details

B-A Baseline Network

Table XVII lists the layer-by-layer network parameters of the baseline network implemented on SENECA. Every batch normalization layer is followed by a ReLU activation function.

TABLE XVII: Layer overview of baseline network used in the SENECA benchmarking results.

Layer	Input	Channel	Kernel	Padding	Number of
	dimension	dimensions	size	(no. cells)	neurons
Maxpooling	128 $\times$ 128	2	4	0	–
Convolutional	32 $\times$ 32	2, 32	3	1	–
Maxpooling	32 $\times$ 32	32	2	0	–
Batch Normalization	16 $\times$ 16	32	–	–	–
Convolutional	16 $\times$ 16	32, 64	3	1	–
Maxpooling	16 $\times$ 16	64	2	0	–
Batch Normalization	8 $\times$ 8	64	–	–	–
Convolutional	8 $\times$ 8	64, 128	3	1	–
Maxpooling	8 $\times$ 8	128	2	0	–
Batch Normalization	4 $\times$ 4	128	–	–	–
Convolutional	4 $\times$ 4	128, 128	3	1	–
Maxpooling	4 $\times$ 4	128	2	0	–
Batch Normalization	2 $\times$ 2	128	–	–	–
Convolutional	2 $\times$ 2	128, 128	3	1	–
Maxpooling	2 $\times$ 2	128	2	0	–
Batch Normalization	1 $\times$ 1	128	–	–	–
ReLU RNN	128	–	–	–	256
Fully Connected	256	–	–	–	11

B-B Hardware Measurement and Comparison

All hardware-related measurements were performed in gate-level simulation using industry-standard ASIC simulation and power measurement tools (Cadence Xcelium and Cadence JOULES) for GF- $22$ nm FDX technology node (in the typical corner $0.8$ V and $25$ C, no back-biasing). The power results are accurate within 15% of signoff power and include the total power consumption of the chip, i.e. both dynamic and static power. The latency results are cycle-accurate with a design frequency of 500 MHz. Same with other compared results, we have not included the I/O power consumption and latency in the reported results. In the reference comparison with other chips, Loihi energy results only includes dynamic power and TrueNorth energy result includes the total power.

Appendix C Sample Visualizations

C-A DVS Gesture

Figure 7 shows one example per gesture class in the DVS Gesture dataset of a test dataset sample. The first five sequentially ordered timebins from each sample is shown starting from the left, and the ROI receptive field is visualized as a yellow square superimposed on the image.

C-B Marshalling Signals

Example gestures from every class in the Marshalling Signals test dataset are visualized in Figures 8 and 9. The test dataset does not contain every possible combination of distance and gesture; every unique combination that occurs is shown in the figures. The distance labels indicate the number of centimeters from the camera which the gesture was recorded from. The gestures with distance label ”xxx” are samples from real-world output distribution data with unknown distance.

C-C Synthetic Dataset Based on N-MNIST

An example testing sample from each digit class of the synthetic dataset based on N-MNIST is visualized in Figure 10, together with the ROI receptive field as a yellow superimposed square. The N-MNIST dataset is recorded in such a way that the digit disappears and re-appears between timebins. The first timebins of a sample are empty and the digit cannot be seen until it appears a few timebins later. It can be noted in Figure 10 that the ROI receptive field initially locates itself somewhere in the center, and as soon as the digit begins to appear it locates the ROI receptive field on the location of the digit. In the case of digit 6, a piece of structured noise appears before the digit 6, and the receptive field begins moving towards the noise. However, once the digit has appeared, the receptive field changes direction and moves towards the digit instead.