Fast Algorithms for Spiking Neural Network Simulation with FPGAs

Björn A. Lindqvist and Artur Podobas
KTH Royal Institute of Technology
Stockholm, Sweden
Abstract

Using OpenCL-based high-level synthesis, we create a number of spiking neural network (SNN) simulators for the Potjans-Diesmann cortical microcircuit for a high-end Field-Programmable Gate Array (FPGA). Our best simulators simulate the circuit 25% faster than real-time, require less than 21 nJ per synaptic event, and are bottle-necked by the device’s on-chip memory. Speed-wise they compare favorably to the state-of-the-art GPU-based simulators and their energy usage is lower than any other published result. This result is the first for simulating the circuit on a single hardware accelerator. We also extensively analyze the techniques and algorithms we implement our simulators with, many of which can be realized on other types of hardware. Thus, this article is of interest to any researcher or practitioner interested in efficient SNN simulation, whether they target FPGAs or not.
Keywords: cortical microcircuit, fpga, hls, hpc, opencl, simulation, spiking neural networks

1 Introduction

natkey1 outline what they see as the four biggest challenges for human brain simulation; scale, complexity, speed, and integration. Scale refers to the enourmous size of the brain – billions of neurons and trillions synapses – which is difficult to simulate at acceptable speeds even for the fastest supercomputers. There are two reasons for this difficulty. First, the conceptual mismatch between how brains and computers operate; the former can be viewed as collections of billions of processing nodes, communicating with sparse events called spikes (ghosh2009third), while the latter is composed of imperative programs, vulnerable to the von Neumann bottleneck (efnusheva2017survey), and executed on general-purpose devices (GPUs or GPUs). Power in-efficiency is the second reason. General-purpose devices are jacks of all trades and can execute a wide variety of workloads, ranging from word processing to scientific simulations. Their generality comes at a high price and executing a single instruction consumes up to three orders of magnitude more energy than the computation itself (jouppi2018motivation). This causes brain simulations to consume far more energy than the roughly 20 Watts a human brain draws (versace2010brain). Finally, the  2004 end of Dennard’s scaling (bohr200730), and the impending termination of Moore’s law (theis2017end), forces researchers to reconsider how to compute efficiently in a post-Moore future. Among the more promising options in a post-Moore world is the use of reconfigurable systems such as Field-Programmable Gate Arrays (FPGAs) (kuon2008fpga).

FPGAs, along with their siblings, Coarse-Grained Reconfigurable Systems, CGRAs (podobas2020survey), belong to the reconfigurable family of computing devices. Unlike traditional general-purpose processors (CPUs and GPUs), their underlying compute fabric is composed of a large number of reconfigurable blocks of different types. The most common blocks are look-up tables (LUTs, often several hundred thousands), digital signal processing (DSPs – capable of tens of TFLOP/s), or on-chip random-access memory (BRAM – tens-to-hundreds of MB) (langhammer2021stratix; murphy2017xilinx). These resources allow designers to create custom hardware that sacrifice generality for better performance and lower energy consumption than general-purpose devices, as well as transcending the latters’ limitations (i.e., the mentioned von Neumann-bottleneck). Importantly, these devices can, and often already are  (sano2023essper; meyer2023multi; boku2022cygnus), live side-by-side in HPC nodes and can be reconfigured before runtime (vipin2018fpga): when the user is running a brain simulation, the FPGA will be configured as an efficient brain simulator, while for a different application a different accelerator will be used. In short, FPGA facilitates the use of special hardware accelerators during application runtime but is general enough so that different accelerators can be configured between applications. This makes them an attractive choice for neuroscience simulation since they can be reconfigured for different neuron, synapse, and axon models, which dedicated neuromorphic ASIC hardware, whose silicon is immutable, cannot be.

Historically, FPGAs have been designed using low-level hardware description languages (HDLs) such as VHDL and Verilog (perry2002vhdl). These languages have a steep learning curve and require specialist knowledge to be comfortable with. However, with the increased maturity of High-Level Synthesis (HLS) tools in the last two decades (nane2015survey), there has been a resurgence of interest in using FPGAs for HPC. HLS allow designers to describe hardware in relatively high-level languages such as C and C++ whose learning curves are shallower  (podobas2017evaluating; zohouri2016evaluating). For example, FPGAs have been used to accelerate Computational Fluid Dynamics (karp2021high; faj2023scalable), Quantum circuit simulations (podobas2023q2logic; aminian2008fpga), Molecular Dynamics (sanaullah2018unlocking), N-Body systems (del2018scalable; menzel2021strong; huthmann2019scaling), and much more, demonstrating advantages over alternative solutions.

In this work, we use of HLS to design simulators for the Potjans-Diesmann cortical microcircuit (potjans2014). While there is ample prior work on FPGA-based neuromorphic systems (see LABEL:sec:sota for related work), our system is (to the best of our knowledge) the most energy-efficient simulator of the Potjans-Diesmann circuit in existence (25 nJ/event), while reaching a faster-than-realtime (\approx 1.2x) simulation speed on a single FPGA. We use the Intel’s OpenCL SDK for FPGA HLS toolchain (czajkowski2012opencl) to design our simulators, but our designs are modular enough to easily be ported to other HLS-based systems (e.g., Vivado (o2014xilinx) or OneAPI), and other FPGAs. Our contributions are:

  • The first simulators of the previously mentioned circuit on a single FPGA, running faster than real-time.

  • The most energy-efficient simulators for the circuit when measured by energy per synaptic event.

  • The presentation and analysis of the algorithms, thought processess, trade-offs, and lessons learned, while designing these simulators.

  • An empirically motivated analysis on what hardware features are required to simulate the circuit even faster than what we are capable of.

The rest of this article is structured as follows. In section 2, we discuss SNNs in general and the microcircuit in particular. We explain how FPGAs work and we briefly introduce the HLS design methodology. In LABEL:sec:ssn-sim we discuss SNN simulation and present the algorithms and ideas underlying our simulators. The utility of many of the ideas have already been demonstrated in other parts of Computer Science, but not in connection with SNN simulation. Hence, we believe they deserve a thorough treatment. We evaluate many different variants and parametrizations of our simulators in LABEL:sec:results. Finally, in LABEL:sec:disc we put our results in perspective and compare them with the state-of-the-art.

2 Material and Methods

We begin with an overview of spiking neural networks (SNNs) before discussing the Potjans-Diesmann cortical microcircuit, an SNN for simulating a small part (microcircuit) of the mammalian brain. In section 2.3 we explain how FPGAs works and what makes them different from from conventional hardware. In sections sections 2.4, 2.5 and LABEL:sec:ocl-fpga we introduce HLS and how we use OpenCL for HLS.

2.1 Spiking Neural Networks

An SNN is an artificial neural network (ANN) that transfers signals in time-dependent bursts, i.e. spikes. Unlike other ANNs, SNNs are designed with biological plausibility in mind, making them useful for neuroscience. SNNs are usually modelled as directed (multi-)graphs, where vertices represent neurons and edges synaptic connections between neurons. Neurons have a membrane potential that varies over time. When the potential exceeds a threshold the neuron discharges – spikes – and sends current via its synapses to its neighbours which they receive after a synapse-specific delay. The amount of current as well as the transfer time is synapse-specific (han2020). The neuron’s statefulness and the non-differentiable, discontinuous signal transfer function are two fundamental aspects distinguishing SNNs from other ANNs. While these basic operating principles are enough to describe most SNNs, SNNs vary in neuron model and other parameters. In this work, we use the basic leaky integrate-and-fire (LIF) neuron model, defined by

RCdudt=urestu(t)+RI(t).𝑅𝐶d𝑢d𝑡subscript𝑢rest𝑢𝑡𝑅𝐼𝑡RC\frac{\mathrm{d}u}{\mathrm{d}t}=u_{\mathrm{rest}}-u(t)+RI(t).italic_R italic_C divide start_ARG roman_d italic_u end_ARG start_ARG roman_d italic_t end_ARG = italic_u start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT - italic_u ( italic_t ) + italic_R italic_I ( italic_t ) . (1)

The equation describes the membrane potential over time. The variables R𝑅Ritalic_R and C𝐶Citalic_C are the resistance and capacitance of the membrane, u(t)𝑢𝑡u(t)italic_u ( italic_t ) its potential at time t𝑡titalic_t, I(t)𝐼𝑡I(t)italic_I ( italic_t ) the amount of current it receives at time t𝑡titalic_t from its neighbours, and urestsubscript𝑢restu_{\mathrm{rest}}italic_u start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT its resting potential. With τm=RCsubscript𝜏𝑚𝑅𝐶\tau_{m}=RCitalic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_R italic_C, and u(0)=urest=0𝑢0subscript𝑢rest0u(0)=u_{\mathrm{rest}}=0italic_u ( 0 ) = italic_u start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT = 0 the solution to the equation is

u(t)=RI(t)RI(t)etτm.𝑢𝑡𝑅𝐼𝑡𝑅𝐼𝑡superscript𝑒𝑡subscript𝜏𝑚u(t)=RI(t)-RI(t)e^{-\frac{t}{\tau_{m}}}.italic_u ( italic_t ) = italic_R italic_I ( italic_t ) - italic_R italic_I ( italic_t ) italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_t end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT . (2)

Solving the equation using forward Euler produces a recurrent, discrete representation of the neuron’s potential over time. With Δt0Δ𝑡0\Delta t\to 0roman_Δ italic_t → 0,

τmu(t+Δt)u(t)Δt=u(t)+RI(t)subscript𝜏𝑚𝑢𝑡Δ𝑡𝑢𝑡Δ𝑡𝑢𝑡𝑅𝐼𝑡\displaystyle\tau_{m}\frac{u(t+\Delta t)-u(t)}{\Delta t}=-u(t)+RI(t)italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT divide start_ARG italic_u ( italic_t + roman_Δ italic_t ) - italic_u ( italic_t ) end_ARG start_ARG roman_Δ italic_t end_ARG = - italic_u ( italic_t ) + italic_R italic_I ( italic_t ) (3)
u(t+Δt)=u(t)+Δtτm(u(t)+RI(t))absent𝑢𝑡Δ𝑡𝑢𝑡Δ𝑡subscript𝜏𝑚𝑢𝑡𝑅𝐼𝑡\displaystyle\implies u(t+\Delta t)=u(t)+\frac{\Delta t}{\tau_{m}}\left(-u(t)+% RI(t)\right)⟹ italic_u ( italic_t + roman_Δ italic_t ) = italic_u ( italic_t ) + divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ( - italic_u ( italic_t ) + italic_R italic_I ( italic_t ) ) (4)
u(t+Δt)=(1Δtτm)u(t)+ΔtRτmI(t).absent𝑢𝑡Δ𝑡1Δ𝑡subscript𝜏𝑚𝑢𝑡Δ𝑡𝑅subscript𝜏𝑚𝐼𝑡\displaystyle\implies u(t+\Delta t)=\left(1-\frac{\Delta t}{\tau_{m}}\right)u(% t)+\frac{\Delta tR}{\tau_{m}}I(t).⟹ italic_u ( italic_t + roman_Δ italic_t ) = ( 1 - divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ) italic_u ( italic_t ) + divide start_ARG roman_Δ italic_t italic_R end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG italic_I ( italic_t ) . (5)

The solution forms the basis of step-wise simulation of LIF neurons.

After a neuron spikes it enters its refractory state. Its potential becomes fixed at uresetsubscript𝑢resetu_{\mathrm{reset}}italic_u start_POSTSUBSCRIPT roman_reset end_POSTSUBSCRIPT and it ceases to respond to stimuli for a duration controlled by the τrefsubscript𝜏ref\tau_{\mathrm{ref}}italic_τ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT parameter. Usually, the refractory period is in the order of milliseconds and for simplicity one sets ureset=urest=0subscript𝑢resetsubscript𝑢rest0u_{\mathrm{reset}}=u_{\mathrm{rest}}=0italic_u start_POSTSUBSCRIPT roman_reset end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT = 0. With r(t)𝑟𝑡r(t)italic_r ( italic_t ) denoting how long the neuron will say refractory at time t𝑡titalic_t, uthrsubscript𝑢thru_{\mathrm{thr}}italic_u start_POSTSUBSCRIPT roman_thr end_POSTSUBSCRIPT the neuron’s spiking threshold, and ΔtΔ𝑡\Delta troman_Δ italic_t arbitrarily set to one, we can incorporate refractoriness into the LIF model:

r(t+1)𝑟𝑡1\displaystyle r(t+1)italic_r ( italic_t + 1 ) ={τrefifu(t+1)uthrr(t)1elseifr(t)>00otherwiseabsentcasessubscript𝜏refif𝑢𝑡1subscript𝑢thr𝑟𝑡1elseif𝑟𝑡00otherwise\displaystyle=\begin{cases}\tau_{\mathrm{ref}}&\mathrm{if}\,u(t+1)\geq u_{% \mathrm{thr}}\\ r(t)-1&\mathrm{elseif}\,r(t)>0\\ 0&\mathrm{otherwise}\\ \end{cases}= { start_ROW start_CELL italic_τ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_u ( italic_t + 1 ) ≥ italic_u start_POSTSUBSCRIPT roman_thr end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r ( italic_t ) - 1 end_CELL start_CELL roman_elseif italic_r ( italic_t ) > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW (6)
u(t+1)𝑢𝑡1\displaystyle u(t+1)italic_u ( italic_t + 1 ) ={(11τm)u(t)+RτmI(t)ifr(t)=00otherwiseabsentcases11subscript𝜏𝑚𝑢𝑡𝑅subscript𝜏𝑚𝐼𝑡if𝑟𝑡00otherwise\displaystyle=\begin{cases}\left(1-\frac{1}{\tau_{m}}\right)u(t)+\frac{R}{\tau% _{m}}I(t)&\mathrm{if}\quad r(t)=0\\ 0&\mathrm{otherwise}\end{cases}= { start_ROW start_CELL ( 1 - divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ) italic_u ( italic_t ) + divide start_ARG italic_R end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG italic_I ( italic_t ) end_CELL start_CELL roman_if italic_r ( italic_t ) = 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW (7)

2.1.1 Network topology

Refer to caption
Figure 1: A fully-connected neural network with two hidden layers, two input neurons, and three output neurons

A major difference between conventional ANNs and SNNs is that the former often are layered; all neurons in one layer only receives inputs from neurons in the previous layer and only sends outputs to neurons in the following layer. If all neurons in a layer send output only to all neurons in the following layer the layer is said to be fully-connected and its signal-transfer can be represented as a matrix-vector multiplication. SNNs can similarily be layered and it has been found to be a good approach for classification (zheng2021). But it is not suitable for simulation as the neurons in real brains are not organized into fully-connected layers. Instead, their topology is “chaotic” and full of recurrent connections, self-loops (autapses), and multiple edges (multapses). This has far-reaching consequences for what data structures are appropriate for SNNs. An adjacency matrix, for example, is not enough to represent their topological richness.

2.2 Potjans-Diesmann’s Microcircuit

Name Value Description (Unit)
ΔtΔ𝑡\Delta troman_Δ italic_t 0.1 Time step duration (ms)
Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 250 Membrane capacity (pF)
τmsubscript𝜏𝑚\tau_{m}italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 10 Membrane time constant (ms)
τrefsubscript𝜏ref\tau_{\mathrm{ref}}italic_τ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT 2 Refractory period (ms)
τsynsubscript𝜏syn\tau_{\mathrm{syn}}italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT 0.5 Postsyn. current time constant (ms)
urestsubscript𝑢restu_{\mathrm{rest}}italic_u start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT -65 Resting and reset potential (mV)
uthrsubscript𝑢thru_{\mathrm{thr}}italic_u start_POSTSUBSCRIPT roman_thr end_POSTSUBSCRIPT -50 Spiking threshold (mV)
vthsubscript𝑣thv_{\mathrm{th}}italic_v start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT 8 Thl. neurons’ mean spiking rate (Hz)
ωextsubscript𝜔ext\omega_{\mathrm{ext}}italic_ω start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT 0.15 Thl. spikes amplitude (mV)
Table 1: Microcircuit’s general and simulation parameters
i𝑖iitalic_i Pop. Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT K𝐾Kitalic_K uinitsubscript𝑢initu_{\mathrm{init}}italic_u start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ωisubscript𝜔𝑖\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
1 L23/exh 20 683 1 600 𝒩(68.28,5.36)𝒩68.285.36\mathcal{N}(-68.28,5.36)caligraphic_N ( - 68.28 , 5.36 ) 𝒩(0.15,0.015)𝒩0.150.015\mathcal{N}(0.15,0.015)caligraphic_N ( 0.15 , 0.015 ) 𝒩(1.5,0.75)𝒩1.50.75\mathcal{N}(1.5,0.75)caligraphic_N ( 1.5 , 0.75 )
2 L23/inh 5 834 1 500 𝒩(63.16,4.57)𝒩63.164.57\mathcal{N}(-63.16,4.57)caligraphic_N ( - 63.16 , 4.57 ) 𝒩(0.6,0.06)𝒩0.60.06\mathcal{N}(-0.6,0.06)caligraphic_N ( - 0.6 , 0.06 ) 𝒩(0.75,0.325)𝒩0.750.325\mathcal{N}(0.75,0.325)caligraphic_N ( 0.75 , 0.325 )
3 L4/exh 21 915 2 100 𝒩(63.33,4.74)𝒩63.334.74\mathcal{N}(-63.33,4.74)caligraphic_N ( - 63.33 , 4.74 ) 𝒩(0.15,0.015)𝒩0.150.015\mathcal{N}(0.15,0.015)caligraphic_N ( 0.15 , 0.015 ) 𝒩(1.5,0.75)𝒩1.50.75\mathcal{N}(1.5,0.75)caligraphic_N ( 1.5 , 0.75 )
4 L4/inh 5 479 1 900 𝒩(63.45,4.94)𝒩63.454.94\mathcal{N}(-63.45,4.94)caligraphic_N ( - 63.45 , 4.94 ) 𝒩(0.6,0.06)𝒩0.60.06\mathcal{N}(-0.6,0.06)caligraphic_N ( - 0.6 , 0.06 ) 𝒩(0.75,0.325)𝒩0.750.325\mathcal{N}(0.75,0.325)caligraphic_N ( 0.75 , 0.325 )
5 L5/exh 4 850 2 000 𝒩(63.11,4.94)𝒩63.114.94\mathcal{N}(-63.11,4.94)caligraphic_N ( - 63.11 , 4.94 ) 𝒩(0.15,0.015)𝒩0.150.015\mathcal{N}(0.15,0.015)caligraphic_N ( 0.15 , 0.015 ) 𝒩(1.5,0.75)𝒩1.50.75\mathcal{N}(1.5,0.75)caligraphic_N ( 1.5 , 0.75 )
6 L5/inh 1 065 1 900 𝒩(61.66,4.55)𝒩61.664.55\mathcal{N}(-61.66,4.55)caligraphic_N ( - 61.66 , 4.55 ) 𝒩(0.6,0.06)𝒩0.60.06\mathcal{N}(-0.6,0.06)caligraphic_N ( - 0.6 , 0.06 ) 𝒩(0.75,0.325)𝒩0.750.325\mathcal{N}(0.75,0.325)caligraphic_N ( 0.75 , 0.325 )
7 L6/exh 14 395 2 900 𝒩(66.72,5.46)𝒩66.725.46\mathcal{N}(-66.72,5.46)caligraphic_N ( - 66.72 , 5.46 ) 𝒩(0.15,0.015)𝒩0.150.015\mathcal{N}(0.15,0.015)caligraphic_N ( 0.15 , 0.015 ) 𝒩(1.5,0.75)𝒩1.50.75\mathcal{N}(1.5,0.75)caligraphic_N ( 1.5 , 0.75 )
8 L6/inh 2 948 2 100 𝒩(61.43,4.48)𝒩61.434.48\mathcal{N}(-61.43,4.48)caligraphic_N ( - 61.43 , 4.48 ) 𝒩(0.6,0.06)𝒩0.60.06\mathcal{N}(-0.6,0.06)caligraphic_N ( - 0.6 , 0.06 ) 𝒩(0.75,0.325)𝒩0.750.325\mathcal{N}(0.75,0.325)caligraphic_N ( 0.75 , 0.325 )
Table 2: Population-specific parameters. The first four columns denote the index, i𝑖iitalic_i, the name, the size, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the number of thalamic connections, Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, of the eight populations. The last three columns denote the Gaussians from which the neurons’ initial potential (mV), the neurons’ synapses amplitudes (mV), and delays (ms) of excitatory postsynaptic potential are sampled from. However, synapse amplitudes from population L23/exh to L3/exh are sampled from 𝒩(0.3,0.03)𝒩0.30.03\mathcal{N}(0.3,0.03)caligraphic_N ( 0.3 , 0.03 ) and not 𝒩(0.15,0.015)𝒩0.150.015\mathcal{N}(0.15,0.015)caligraphic_N ( 0.15 , 0.015 ).
L23/exc L23/inh L4/exc L4/inh L5/exc L5/inh L6/exc L6/inh
L23/exc 0.1009 0.1689 0.0437 0.0818 0.0323 0.0000 0.0076 0.0000
L23/inh 0.1346 0.1371 0.0316 0.0515 0.0755 0.0000 0.0042 0.0000
L4/exc 0.0077 0.0059 0.0497 0.1350 0.0067 0.0003 0.0453 0.0000
L4/inh 0.0691 0.0029 0.0794 0.1597 0.0033 0.0000 0.1057 0.0000
L5/exc 0.1004 0.0622 0.0505 0.0057 0.0831 0.3726 0.0204 0.0000
L5/inh 0.0548 0.0269 0.0257 0.0022 0.0600 0.3158 0.0086 0.0000
L6/exc 0.0156 0.0066 0.0211 0.0166 0.0572 0.0197 0.0396 0.2252
L6/inh 0.0364 0.0010 0.0034 0.0005 0.0277 0.0080 0.0658 0.1443
Table 3: Probability that a random neuron in the population specified by the rows is connected to a random neuron in the population specified by the columns.

In 2014 potjans2014 compiled the results of a dozen empirical studies to create a full-scale spiking neural network microcircuit of one cortical column in the mammalian early sensory cortex. The microcircuit covers 1 mm2superscriptmm2\mathrm{mm}^{2}roman_mm start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the cerebral cortex and consists of 77,169 LIF neurons grouped into four neocortical layers;111These layers are not analoguous to layers in conventional ANNs. L23, L4, L5, and L6.222Layer L2 and L3 merged and L1 omitted. Each layer is subdivided into one excitatory population that increases neural activity and one inhibitory population that decreases it. Around 300 million synapses connect the neurons.

The microcircuit is a balanced random network so that neural activity is balanced by excitatory neurons that inreases activity and inhibitory ones that dampen it (brunel2000). Connectivity and features are sampled from parametric probability distributions, rather than set explicitly. Table 2 and table 3 specify these distributions’ parameters. For example, the initial potential of every inhibitory neuron in L23 is set by sampling a Gaussian with mean -63.16 mV and standard deviation 4.57 mV and the expected number of synapses from population L23/inh to population L5/exc is 5,8344,8500.07552583448500.075525,834\cdot 4,850\cdot 0.0755\approx 25 , 834 ⋅ 4 , 850 ⋅ 0.0755 ≈ 2 million. Synapses are sampled with replacement – multiple synapses can connect the same neuron pair. While the neurons are arranged in in terms of neocortical layers, the layers’ connection probabilities show that the network’s topology is not layered; neurons in most populations can connect to neurons in any of the other populations.

In addition to the synapses within the cortical column, the circuit receives spikes from external neurons – thalamic input. Column K𝐾Kitalic_K in table 2 specifies the number of thalamic neurons a given population’s neurons receive spikes from, and parameter vthsubscript𝑣thv_{\mathrm{th}}italic_v start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT in table 1 how frequently thalamic neurons spike.333potjans2014 set the parameter to 15 Hz, but we use the NEST model as a baseline, where it is set to 8 Hz. The expected number of thalamic spikes received per second by all neurons in a population is vthKsubscript𝑣th𝐾v_{\mathrm{th}}Kitalic_v start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT italic_K. For example, neurons in population L23/exc receive about 81,600=12,800formulae-sequence81600128008\cdot 1,600=12,8008 ⋅ 1 , 600 = 12 , 800 thalamic spikes per second. The amplitude of all thalamic synapses is fixed at ωexh=0.15subscript𝜔exh0.15\omega_{\mathrm{exh}}=0.15italic_ω start_POSTSUBSCRIPT roman_exh end_POSTSUBSCRIPT = 0.15 mV. As thalamic spikes can be computationally expensive to simulate, potjans2014 suggest approximating them with constant direct current injected at a rate of vthKωexhτsynsubscript𝑣th𝐾subscript𝜔exhsubscript𝜏synv_{\mathrm{th}}K\omega_{\mathrm{exh}}\tau_{\mathrm{syn}}italic_v start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT italic_K italic_ω start_POSTSUBSCRIPT roman_exh end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT mV per second. In our simulator we model thalamic spikes, however.

The synaptic parameters are also scaled. The synapse amplitude by wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT which is a function of the membrane time constant, τmsubscript𝜏𝑚\tau_{m}italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, membrane capacity, Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and the postsynaptic time constant, τsynsubscript𝜏syn\tau_{\mathrm{syn}}italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT, that maps postsynaptic potential to postsynaptic current:444See hanuschkin2010 for the derivation.

d𝑑\displaystyle ditalic_d =τsynτmabsentsubscript𝜏synsubscript𝜏𝑚\displaystyle=\tau_{\mathrm{syn}}-\tau_{m}= italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (8)
p𝑝\displaystyle pitalic_p =τsynτmabsentsubscript𝜏synsubscript𝜏𝑚\displaystyle=\tau_{\mathrm{syn}}\tau_{m}= italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (9)
q𝑞\displaystyle qitalic_q =τm/τsynabsentsubscript𝜏𝑚subscript𝜏syn\displaystyle=\tau_{m}/\tau_{\mathrm{syn}}= italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT (10)
wfsubscript𝑤𝑓\displaystyle w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT =Cmdp(qτm/dqτsyn/d)585absentsubscript𝐶𝑚𝑑𝑝superscript𝑞subscript𝜏𝑚𝑑superscript𝑞subscript𝜏syn𝑑585\displaystyle=\frac{C_{m}d}{p(q^{\tau_{m}/d}-q^{\tau_{\mathrm{syn}}/d})}% \approx 585= divide start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_d end_ARG start_ARG italic_p ( italic_q start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT / italic_d end_POSTSUPERSCRIPT - italic_q start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT / italic_d end_POSTSUPERSCRIPT ) end_ARG ≈ 585 (11)

The constants p22subscript𝑝22p_{22}italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT and p11subscript𝑝11p_{11}italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT define the membranes’ and presynaptic currents’ decay rate:

p11subscript𝑝11\displaystyle p_{11}italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT =exp(Δt/τsyn)0.82absentΔ𝑡subscript𝜏syn0.82\displaystyle=\exp(-\Delta t/\tau_{\mathrm{syn}})\approx 0.82= roman_exp ( - roman_Δ italic_t / italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT ) ≈ 0.82 (12)
p22subscript𝑝22\displaystyle p_{22}italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT =exp(Δt/τm)0.99absentΔ𝑡subscript𝜏𝑚0.99\displaystyle=\exp(-\Delta t/\tau_{m})\approx 0.99= roman_exp ( - roman_Δ italic_t / italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≈ 0.99 (13)

The injection of the presynaptic current is scaled by p21subscript𝑝21p_{21}italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT:

β𝛽\displaystyle\betaitalic_β =τsynτm/(τmτsyn)absentsubscript𝜏synsubscript𝜏𝑚subscript𝜏𝑚subscript𝜏syn\displaystyle=\tau_{\mathrm{syn}}\tau_{m}/(\tau_{m}-\tau_{\mathrm{syn}})= italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT / ( italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT roman_syn end_POSTSUBSCRIPT ) (14)
γ𝛾\displaystyle\gammaitalic_γ =β/Cmabsent𝛽subscript𝐶𝑚\displaystyle=\beta/C_{m}= italic_β / italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (15)
p21subscript𝑝21\displaystyle p_{21}italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT =p11γ(exp(Δt/β)1)0.00036absentsubscript𝑝11𝛾Δ𝑡𝛽10.00036\displaystyle=p_{11}\gamma(\exp(\Delta t/\beta)-1)\approx 0.00036= italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_γ ( roman_exp ( roman_Δ italic_t / italic_β ) - 1 ) ≈ 0.00036 (16)

The subscripts match those found in the source code for the NEST simulator (plesser2015) and have no deeper meaning here.555https://github.com/nest/nest-simulator/blob/aa9907a5d5a7916eeeeca00c2e6584202702eb2a/models/iaf_psc_exp.cpp Taken together, this gives us the following discrete recurrences for the step-wise update of membrane potential, utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

ut+1={uresetifrt>0p22ut+Itp21otherwise,subscript𝑢𝑡1casessubscript𝑢resetifsubscript𝑟𝑡0subscript𝑝22subscript𝑢𝑡subscript𝐼𝑡subscript𝑝21otherwiseu_{t+1}=\begin{cases}u_{\mathrm{reset}}&\text{if}\quad r_{t}>0\\ p_{22}u_{t}+I_{t}p_{21}&\text{otherwise}\\ \end{cases},italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_u start_POSTSUBSCRIPT roman_reset end_POSTSUBSCRIPT end_CELL start_CELL if italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW , (17)

presynaptic current, Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

It+1=p11It+Ttwfωexh,subscript𝐼𝑡1subscript𝑝11subscript𝐼𝑡subscript𝑇𝑡subscript𝑤𝑓subscript𝜔exhI_{t+1}=p_{11}I_{t}+T_{t}w_{f}\omega_{\mathrm{exh}},italic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT roman_exh end_POSTSUBSCRIPT , (18)

and refractoriness, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

rt+1={τrefifut+1uthrrt1elseifrt>00otherwise.subscript𝑟𝑡1casessubscript𝜏refifsubscript𝑢𝑡1subscript𝑢thrsubscript𝑟𝑡1elseifsubscript𝑟𝑡00otherwiser_{t+1}=\begin{cases}\tau_{\mathrm{ref}}&\text{if}\quad u_{t+1}\geq u_{\mathrm% {thr}}\\ r_{t}-1&\text{elseif}\quad r_{t}>0\\ 0&\text{otherwise}\end{cases}.italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_τ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT end_CELL start_CELL if italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≥ italic_u start_POSTSUBSCRIPT roman_thr end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 end_CELL start_CELL elseif italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW . (19)

The variable Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the number of thalamic spikes received by the presynapse at time t𝑡titalic_t and is modelled as a Poisson distributed random variable with mean vthKΔtsubscript𝑣th𝐾Δ𝑡v_{\mathrm{th}}K\Delta titalic_v start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT italic_K roman_Δ italic_t.

Having presented the theoretical foundations for LIF SNNs and the specifics of the Potjans-Diesmann microcircuit, we now present the technologies implement our simulator with. We return to SNN simulation in LABEL:sec:ssn-sim where we both delve deep into simulation methods and present our simulators.

2.3 Field-Programmable Gate Arrays

An FPGA is a type of reprogrammable integrated circuit. First marketed by Altera and Xilinx in the 1980’s, FPGAs have found uses in many niches of the electronics industry; in avionics, in telecommunications, and in VLSI design because of their unique blend of performance, flexibility, and development costs (trimberger2015). FPGAs are not as performant as Application-Specific Integrated Circuits (ASICs), but ASICs are extremely expensive to develop which makes them cost-prohibitive unless the number of units produced runs in the tens of millions. Furthermore, ASICs are not reprogrammable and thus suitable only for the specific tasks they were developed for. Central Processing Units (CPUs), on the other hand, are flexible and inexpensive, but have low performance. Graphics Processing Units (GPUs) sit between CPUs and ASICs on the flexibility-performance spectrum. While GPUs can run any computation, they generally only excel at highly regular, numerically intensive computations. In particular, they do not handle divergent control flow well.

FPGAs trade performance for flexibility in a different manner than CPUs, GPUs, and ASICs. They consist of configurable blocks organized in a grid-like fabric and each block can be configured to compute a small and specific function. Like the logical and of two one-bit signals or comparing two eight-bit numbers. The number of configurable blocks varies enormously from one FPGA to another; high-end FPGAs can contain hundreds of thousands or even millions of blocks. For performance-sensitive workloads, FPGAs’ advantage is the lack instruction processing overhead. CPUs and GPUs load programs from memory and decode and process their instructions one after the other, while FPGAs merely pass data through a pre-configured, fixed-function circuit, similar to how ASICs operate.666It is of course true that superscalar and pipelined processors can handle many instructions in parallel. But the point stands; the overhead caused by the need to decode and dispatch instructions is large. However, the reprogrammability of FPGAs comes at a great cost and their raw performance cannot rival that of ASICs or GPUs unless algorithms are specifically designed for them. They also run at lower clock frequencies than comparable CPUs from the same generation. For example, our simulators run at around 600 MHz,777However, langhammer2023 ran a softcore microprocessor at over 770 MHz on the same FPGA which is close to the theoretical limit. while Intel Core i9 processors operate at over 3 GHz. To stay competitive, FPGA implementations need to carry out much more work per clock cycle than equivalent CPU implementations. These potential drawbacks aside, researchers have deployed FPGAs for performance sensitive workloads, with promising results (nguyen2022).

Bitstreams are files that configure FPGA blocks and specialized tools write them to the FPGA’s fabric. Bitstreams are to FPGAs what machine code are to microprocessors. Tools that take descriptions of the digital circuit the FPGA should implement and produce bitstreams are called synthesizers. The descriptions are usually written in hardware description languages (HDLs), but many tools support higher-level languages as well.

Refer to caption
Figure 2: A LUT for computing any two-variable boolean function implemented as one 4:1 multiplexer and four memory bits. The bit values determines which function the LUT computes.

A configurable block is a logic block that functions as a small combinational circuit that computes the boolean function it was configured for. Depending on configuration, the same logic block can serve as an and-gate, or-gate, xor-gate, etc. Memory embedded adjacent to the block stores its configuration and the block’s output is retrieved from this memory. Because the blocks look up their results from memory they are called look-up-tables (LUTs). Figure 2 shows a LUT for a two-input and-gate built with a multiplexer and four memory bits. Flip-flops (FFs) are the FPGA’s basic memory blocks and work as small and fast RAMs. Each FF stores one bit, but multiple FFs can be combined into registers to store larger datums.

In addition to configuring the FPGA’s logic and memory blocks, bitstreams define how the blocks should be connected through the FPGA’s interconnect network. This is called routing and is one of the synthesizer’s most important tasks. Routing is critical to the design as the interconnect occupies most of the FPGA’s fabric (betz1999). Good routing should minimize the total length of the interconnect, the number of blocks required, and the length of the longest path connecting two blocks as the operating frequency of the design is bounded by this length. Routing is a challenging problem in computer science and finding high-quality routing of large designs is both difficult and time-consuming.

Due to their reconfigurability, digital logic implemented in LUTs is slower and requires more components than equivalent logic implemented in non-reconfigurable ASIC gates. Memories built using FFs have lots of overhead because FFs only store one bit. Therefore, modern FPGAs come with non-reconfigurable blocks for arithmetic and storage. One can view these blocks as small ASICs embedded in the FPGA that the designer connects via the configurable blocks. For example, our Agilex 7 FPGA have five different types of blocks; 487 200 Adaptive Logic Modules (ALMs) which functions as LUTs, 1 948 800 flip-flops, 7 110 M20K RAM blocks, two 18.432 Mb eSRAM blocks, and 4 510 DSPs. We describe these blocks in the following sections.

2.3.1 Adaptive Logic Module

Refer to caption
Figure 3: Simplified schematic of the Agilex 7 ALM. The multiplexers’ (trapezoids) control signals (not shown) and the contents of the LUT configures the ALM. The ALM can serve as – among other things – a four-bit adder, a four-bit memory, or as combinational logic of six inputs depending on configuration.

The Adaptive Logic Module (ALM) in figure 3 is the basic building block of Intel’s family of FPGAs. Consisting of one fractured eight-input-LUT, four FFs, and two full-adders, it is a versatile multi-purpose block and can – depending on its configuration – compute two four-input boolean functions, one six-input boolean function, or perform four-bit addition with carry. It can also serve as a four-bit memory.

2.3.2 Digital Signal Processing

The Digital Signal Processing (DSP) block contains functions for multiplication, addition, subtraction, and accumulation. The block functions as the FPGA’s arithmetic logic unit (ALU). Agilex 7 has two types of DPSs; one for integer arithmetic and one for IEEE 754 floating-point arithmetic in single- and half-precision modes. Pipeline registers organized into three stages are contained within the DSP. Data can be routed through one or more of the stages or bypassed completely to achieve a given latency. This can be useful if one operand is available multiple cycles before the other.

2.3.3 Memory Hierarchy

In processor-based hardware, registers, caches, and main memory is organized into a fixed hierarchy. Applications must be structured around the memory hierarchy to optimally use it. Typically, data transfer between the hierarchy’s levels is implicit and not directly under the programmer’s control. On an FPGA the designer constructs the memory hierarchy from the available memory blocks and has fine-grained control over exactly where data is stored. This allows the memory system to be tailored to the application rather than the other way around.

The FPGA’s memory is on-chip if it is embedded in the FPGA fabric itself and off-chip otherwise. On-chip memory is much faster and smaller than off-chip memory. The Agilex 7 has three types of on-chip memory; Memory Logic Array Blocks (MLABs), M20Ks, and embedded SRAM (eSRAM). Scalars and shallow FIFOs are often stored in MLABs, small arrays and caches in M20Ks, and larger buffers in eSRAM. MLABs are not memory blocks per se, but instead a technique for using ALMs as memory. The Agilex 7 can combine ten unused ALMs into one 640-bit register. As many FPGA designs do not use all logic blocks, repurposing them can be very useful. MLABs have very low laencies since they are close to the logic that use them. Xilinx FPGAs implement the same concept under the name distributed RAM or LUTRAM. On-chip M20Ks are also known as block RAM (BRAM) and are much larger than MLABs. The Agilex 7 contains around seven thousand M20K each of which can hold 20 kbit of data, as their name suggets. In total, about 139 Mbit. The board also has two 36 Mbit embedded SRAM blocks. These have high bandwidth and high random transaction rates and compliment the other on-chip memory blocks. Unfortunately, the Intel OpenCL FPGA compiler cannot infer eSRAM so they are unused in this work. On the Agilex 7 four 8 GB DRAM sticks constitute the board’s off-chip memory. Accessing off-chip memory takes much longer than accessing on-chip memory, with access times measured in the hundreds of cycles.

In addition to controlling the types of memory used, the designer can also configure individual memory blocks. Multipum**, for example, can sometimes double memory throughput at the expense of significantly lowering the design’s operating frequency. Banking can increase expected memory access concurrency, but also increase stalls due to conflicts so the technique is best reserved for evenly distributed data.

2.4 High-Level Synthesis

Designers traditionally design circuits in Hardware Description Languages (HDLs) such as VHDL and (System) Verilog. With these the designer specifies the behaviour of all the circuit’s logic gates and flip-flops. The result is a Register-Transfer Level (RTL) design, so called because it models the register-to-register transfer the circuit’s signals. A synthesizer takes the RTL design and produces a low-level representation (netlist) of it which can be lowered further to create an ASIC or transformed into an FPGA bitstream. The synthesizer’s job is far from straight-forward and it must – among other things – place every component of the circuit on a two-dimensional grid and ensure that it can operate with the desired clock frequency.

While HDLs offer a great deal of control over the resulting circuit, they are low-level and lack support for many high-level programming constructs so using them can be tedious and error-prone. It often makes sense to work in a higher-level language instead. That workflow is called High-Level Synthesis (HLS) and is supported by many tools. For example, Intel’s FPGA software generates synthesizable Verilog code from designs coded in the imperative C-like language OpenCL.

The main worry of using HLS is that the hardware will not be as efficient as if an HDL had been used. As the designer works in an imperative, high-level language their view of the hardware may be obscured and not easy to visualize. Furthermore, the machine-generated HDL generated by HLS tools is often opaque and near impossible to understand. These fears may be unfounded, however, as many studies have found that HLS does not degrade performance and sometimes even improve it (lahti2018).

2.5 OpenCL

1__kernel void mul_sd(
2 __global float *A,
3 __global float *B,
4 __global float *C,
5 int N) {
6 for (uint i = 0; i < N; i++)
7 C[i] = A[i] * B[i];
8}
9__kernel void mul_nd(
10 __global float *A,
11 __global float *B,
12 __global float *C,
13 int N) {
14 uint i = get_global_id(0);
15 if (i < N)
16 C[i] = A[i] * B[i];
17}\end{lstlisting}
18\caption{Single and multiple work item OpenCL kernels for element-wise vector product}
19\label{lst:kernels}
20\end{figure}
21
22\noindent In 2009 the Kronos Group published the first version of the
23OpenCL (Open Computing Language) standard \citep{munshi2009}. The goal
24was to replace all vendor-specific languages and application
25programming interfaces (APIs) with a common and portable standard for
26writing high-performance code across all kinds of
27devices. An OpenCL application (unlike one written in a competing
28technology such as CUDA) can run \textit{unmodified} on any device
29with a compliant OpenCL implementation. Today OpenCL now runs on
30millions of CPUs, GPUs, DSPs, and other computing devices.
31
32The OpenCL standard consists of three parts; a specification for a
33programming language, a host API, and a device API. The host API runs
34on the users main computing device (e.g., PC) and controls their
35accelerator (e.g., GPU), whose functionality is accesible through the
36device API. Thus, OpenCL prescribes both how the programmer
37should program the accelerator and how to manage it from an external
38device.
39
40The OpenCL language closely resembles C. It supports functions, loops,
41multiple variable scopes, aggregate data types, and many other
42programming constructs familiar to C programmers. A big difference
43from C is that OpenCL comes with three explicit pointer spaces;
44global, local, and private. These grant the programmer fine-grained
45control over data storage. What constitutes global, local, and
46private memory is device-specific. Generally the largest and
47slowest memory space is global, while the smallest and fastest is
48private. Global memory may be sized in gigabytes, while private memory
49may only be a few kilobytes. OpenCL has native support for SIMD
50types to make it easier to write algorithms for parallel
51hardware. Unlike C, OpenCL specifies the bit width of most builtin
52types -- an \verb!int! is always 32 bits and a \verb!long! 64 bits.
53
54OpenCL was designed with massively parallel architectures in mind
55(e.g., GPUs) and has builtin support for concurrency in the form of
56\textit{work items}. A work item is a small indivisible unit of work
57with, ideally, no dependencies to other work items. The OpenCL runtime
58can schedule independent work items to maximally exploit the
59hardwares potential. Consider the kernels in listing
60\ref{lst:kernels} for computing the element-wise vector
61product. Launching the first kernel -- \verb!mul_sd! -- causes OpenCL
62to instantiate one kernel on one core which runs $N$ iterations of the
63for loop. Launching the second kernel, however, causes OpenCL to
64instantiate up to $N$ kernels, each of which runs on an available core
65and computes one iteration of the for loop. Since the second kernel
66runs more computations in parallel, it likely runs much
67faster. Sometimes the algorithms work cannot feasibly be separated
68into completely independent work items. For such situations, work
69items that must be synchronized can be organized into \textit{work
70 groups} which share local memory. However, synchronization is
71impossible between work groups.
72
73\subsection{OpenCL on Intel FPGAs}
74\label{sec:ocl-fpga}
75\begin{figure}
76 \begin{lstlisting}[style=opencl]
77for (uint i = 0; i < N; i++) {
78 B[i] = h(g(f(A[i])));
79}\end{lstlisting}
80 \caption{A loop-nest that could benefit from pipeline
81 parallelism. The functions \texttt{f}, \texttt{g}, and \texttt{h} are
82 assumed to be short and inlineable.}
83 \label{lst:para}
84\end{figure}
85\begin{figure}
86 \begin{lstlisting}[style=opencl]
87channel uint jobs
88 __attribute__ ((depth(512)));
89__kernel void consumer() {
90 while (true) {
91 uint job =
92 read_channel_intel(jobs);
93 if (!msg)
94 break;
95 process_job(job);
96 }
97}
98__kernel void producer(uint N)
99 for (uint i = 0; i < N; i++) {
100 uint job = create_job(i);
101 send_channel_intel(job);
102 }
103 send_channel_intel(0);
104}\end{lstlisting}
105 \caption{Producer-consumer kernels communicating via a channel}
106 \label{lst:chans}
107\end{figure}
108
109\noindent As FPGAs work very differently from processor-based
110hardware, Intels OpenCL implementation for FPGAs differs in important
111ways from other OpenCL implementations. One major difference concerns
112parallelism. CPUs and GPUs have multiple cores to run multiple
113computations in parallel. Thus, the workload of an OpenCL program
114organized into multiple work items can be mapped onto multiple
115cores. FPGAs do not have any cores in the usual sense and it is
116difficult for Intels OpenCL FPGA compiler to synthesize code
117structured around work items into efficient FPGA bitstreams. Instead,
118Intel recommends designers to structure OpenCL code as single-work
119item kernels, and to derive most of the parallelism from executing
120multiple loop iterations concurrently; a type of concurrency known as
121‘‘pipeline parallelism’’. For example, assume the three functions
122\verb!f!, \verb!g!, and \verb!h! in listing \ref{lst:para} are short,
123side-effect free, and represent computations that can be performed in
124less than one clock cycle. The FPGA compiler can create a three-stage
125pipeline for the loop, where combinational circuits for \verb!f!,
126\verb!g!, and \verb!h! constitute the pipelines three stages. This
127pipeline can process three iterations of the loop in parallel; while
128the \verb!h!-circuit processes data for the first, the
129\verb!g!-circuit processes data for the second, and the
130\verb!f!-circuit for the third iteration, and so on. While the latency
131of every iteration is three cycles, the throughput (initiation
132interval) is only one cycle and the latency for the loop as a whole is
133$N + 2$ since it takes 2 cycles to fill the pipeline. Though, this
134assumes that the latency of every function is fixed and
135predictable. If one function executes operations with variable
136latencies, such as memory accesses, the pipeline may need to
137stall. Moreover, pipeline parallelism cannot increase the throughput
138of the loop beyond one cycle -- for that one has to use other
139techniques.
140
141The FPGA compiler supports many compiler directives
142(‘‘pragmas’’) that can help it better optimize difficult
143loops or loops containing invariants it can not
144prove on its own. Two important directives are \verb!#pragma ivdep!
145and \verb!#pragma unroll N!. The first tells the compiler that the
146loop contains no loop-carried dependencies and therefore it
147can reorder the loop iterations as it pleases. The second tells it to
148duplicate the loop body $N$ times. This means that the resulting
149circuit will use $N$ times as many gates, but, potentially, run $N$
150times faster. Other compilers ignore these directives.
151
152The FPGA compiler extends OpenCL with syntax for declaring channels
153for inter-kernel communication. Channels resemble the pipes feature in
154OpenCL 2.0 which the FPGA compiler does not support. Channels are
155implemented in on-chip memory as first in first out (FIFO) buffers of
156the desired depths and are very fast. \Cref{lst:chans} shows a simple
157producer-consumer example, where one kernel calls \verb!send_channel_intel!
158to put jobs on the FIFO and the other calls \verb!read_channel_intel!
159to remove them. Both functions block if the FIFO is full or empty.
160
161%% \Cref{fig:opencl} shows an entry point (externally callable function)
162%% to an OpenCL program -- a \textit{kernel} -- that computes the
163%% element-wise product of two vectors. The keyword \verb!__kernel!
164%% denotes that the function \verb!mul_ab! is an entry point and
165%% \verb!__global! that the associated pointers reside in the global
166%% memory space. What constitutes global memory is device dependent, but
167%% is generally the devices largest and slowest memory. It can be
168%% contrasted with the local memory (denoted by the keyword
169%% \verb!__local!) which is both considerably smaller and faster to
170%% access than the global memory.
171
172
173%% The kernels \verb!#pragma! lines are compiler-specific directives
174%% that hints to the compiler how to synthesize the code. This is useful
175%% if the code contains invariants the compiler cannot prove or if the
176%% code is difficult for the compiler to optimize. The first directive
177%% \verb!#pragma ivdep! tells the compiler that there is no loop-carried
178%% dependencies in the loop nest, allowing the compiler to reorder the
179%% loop iterations or execute them in parallel. The second directive
180%% \verb!#pragma unroll 4! instructs the compiler to unroll the loop four
181%% times. This means that the resulting circuit will use four times as
182%% much hardware, but will potentially run four times as fast. Note that
183%% these directives have no effect when the code is compiled with a
184%% compiler other than with Intels FPGA SDK for OpenCL compiler.
185
186%% As OpenCL is tailored to take advantage of massively parallel
187%% architectures it supports dividing work loads into work groups and
188%% work items.
189
190%% \subsection{Spiking Neural Network}
191
192%% The spiking neural network (SNN) we simulate is the 1 $\mathrm{mm}^3$
193%% model of the sensory cortex developed by \cite{potjans2014cell}. It
194%% consists of 77,169 neurons grouped into four layers, each of which is
195%% subgrouped into one excitatory and one inhibitory population. About
196%% 300 million synapses connect the neurons randomly with
197%% population-specific probability densities. The neurons membrane
198%% voltage, $V$, is updated according to the leaky integrate-and-fire
199%% (LIF) differential equation;
200%% \begin{equation}
201%% \tau_mU’(t) = -U(t) + R_mI_{\in}(t).
202%% \end{equation}
203%% where $\tau_m$ and $R_m$ represents the time and resistance of the
204%% neurons membrane, and $V_{rest}$ the neurons resting voltage. These
205%% are all the same for all neurons in the model.
206
207%% When the membranes voltage crosses the threshold $V_{thresh}$ the
208%% neuron emits a spike
209
210
211\section{SNN Simulation}
212\label{sec:ssn-sim}
213
214\begin{figure}
215 \begin{lstlisting}[style=pseudocode]
216for t in range(n_tics):
217 for i in range(n_neurons):
218 N[i] = update_neuron(N[i])
219 if spikes(N[i]):
220 Q = enqueue(Q, i, t)
221 for i in range(n_neurons):
222 c = collect(Q, i, t)
223 N[i] = update_psc(Q, c)
224 Q = dequeue(Q, i, t)\end{lstlisting}
225 \caption{Synchronous SNN simulation}
226 \label{lst:basic}
227\end{figure}
228
229\noindent Having reviewed SNNs, the Potjans-Diesmann microcircuit and
230our implementation tools, we now discuss methods for simulating SNNs
231efficiently. We explore SNN simulation in general before introducing
232the ideas and algorithms we use to optimize our simulators. The end
233result is a taxonomy of twelve simulator families, grouped by their
234algorithms and implementation styles. We illustrate most of our points
235with Pythonesque pseudo-code, extended with two keywords;
236\verb!parfor! and \verb!atomic!. The former for loops whose iterations
237are independent and therefore can be executed in parallel. The latter
238for operations that must execute as one indivisible unit. Text in
239\verb!{brackets}! explains what the pseudo-code should do at that
240point.
241
242Broadly speaking, SNN simulation can be categorized on whether it is
243synchronous (time-step**) or asynchronous
244(event-driven). Synchronous simulation updates the state of every
245simulated object at every tick of a clock, regardless of whether it is
246necessary or not \citep{brette2007}. Asynchronous simulation only
247updates simulated objects whey they receive external stimuli, i.e.,
248events. In an asynchronous SNN simulation, spike emission and
249reception constitute the events, since the state of a neuron at a time
250between two events can be calculated easily (and generally is
251unimportant). Hybrid strategies, with asynchronous updates for some
252parts of the SNN and synchronous updates for other parts, are
253possible.
254
255For SNN simulation, asynchronicity offers precision advantages. The
256neurons membrane potential only has to be recomputed when it receives
257spikes. If this happens only rarely the simulator can afford to use
258more sophisticated methods than (repeated) forward Euler to solve the
259LIF equation (\cref{eqn:lif}). Also, spikes can be sent and received
260at any time and do not have to be confined to a discrete grid. For
261example, a synchronous simulator with a 0.1 ms time steps may not be
262able to represent spikes sent at times that are not multiples of 0.1
263ms. Asynchronicity may also have scalability advantages as the
264neurons states do not have to be synchronized to a global
265clock. However, asynchronicity entails irregular computation and
266irregular memory accesses -- traits that are extremely undesirable on
267modern hardware. Furthermore, dense SNNs have many more synapses than
268neurons which results in cascading effects. One spiking neuron ‘‘wakes
269up’’ thousands of its neighbours, causing them to spike and in turn
270wake up thousands of their neighbours. On the whole, the event
271processing overhead may dominate over whatever computational savings
272not being bound by a global clock brings. For example,
273\cite{pimpini2022} presents a sophisticated asynchronous CPU-based SNN
274simulator that supports speculative execution so that future neuron
275states can be computed in advance and then rolled back if received
276spikes (‘‘stragglers’’) invalidates their predicted state. While the
277authors measured accuracy improvements, the performance was
278lackluster. For these reasons, most high-performance SNN simulators,
279including ours, are synchronous.
280
281Synchronous simulation splits the simulation task into two phases; one
282for updating neurons and one for transferring spikes. Listing
283\ref{lst:basic} shows the basic algorithm. It uses two data
284structures; an indexable data structure, \verb!N!, to stores the state
285of all neurons and a queue, \verb!Q!, to keeps track of spikes in
286flight. For every neuron the algorithm calls \verb!update_neuron! to
287update its membrane potential, presynaptic current, and refractoriness
288in accordance with \cref{eqn:mem,eqn:psc,eqn:ref}. If \verb!spikes!
289indicates that the neuron spikes, it enqueues the neurons index $i$
290and the current time step $t$ in \verb!Q!. In the next phase (line 6
291to 9) the algorithm calls \verb!collect! on the queue to aggregate
292current destined to the $i$:th neuron at time step $t$. The call to
293\verb!update_psc! adds the aggregated current to the neurons
294presynaptic current. Finally, the algorithm removes the current it
295just handled from the queue. In this listing we include the
296\verb!for t in range(n_tics)! loop which shows that the algorithm
297repeats the two phases \verb!n_tics! times. However, for brevitys
298sake, we omit this outer loop in the following listings.
299
300The work required for updating the neurons state for one tick is in
301the order of $O(N)$, where $N$ is the number of neurons --
302i.e. linear. However, for transferring spikes it is $O(f\Delta
303tpN^2)$, where $p$ is the probability that two randomly chosen neurons
304are connected by one or more synapses, $f$ the average spiking rate,
305and $\Delta t$ the tick duration. So the transfer time is proportional
306to the networks density and quadratic in $N$ -- i.e. for large $N$ it
307dominates. Furthermore, updating the membranes and presynapses
308require a handful of multiplications per neuron -- cheap on modern
309hardware -- whereas spike transfer requires expensive reads and writes
310to and from non-contiguous memory. Hence, we will focus on the
311transfer phase which is where synchronous simulators spend most of
312their time in the remainder of this section.
313
314\subsection{Pushing and Pulling}
315\label{sec:push-pull}
316\begin{figure}
317 \begin{lstlisting}[style=pseudocode]
318parfor i in range(n_neurons):
319 spikes = False
320 if R[i] == 0:
321 x = p22*U[i] + p21*I[i]
322 spikes = x >= U_thresh
323 if spikes:
324 x = 0
325 R[i] = t_ref_tics
326 U[i] = x
327 else:
328 U[i] = 0
329 R[i] -= 1
330 if spikes:
331 parfor j, d, w in syns_from(i):
332 atomic W[t + d, j] += w
333 I[i] = p11*I[i] + T[t, i]*wpsn
334 I[i] += W[t, i]\end{lstlisting}
335 \caption{One time step of push-based spike transfer}
336 \label{lst:push}
337\end{figure}
338
339\begin{figure}
340 \begin{lstlisting}[style=pseudocode]
341parfor i in range(n_neurons):
342 A[t, i] = False
343 if R[i] == 0:
344 x = p22*U[i] + p21*I[i]
345 if x >= U_thresh:
346 A[t, i] = True
347 x = 0
348 R[i] = t_ref_tics
349 else:
350 x = 0
351 R[i] -= 1
352 U[i] = x
353 s = 0
354 for j, d, w in syns_to(i):
355 if A[t - d, j]:
356 s += w
357 I[i] = p11*I[i] + T[t, i]*wpsn + s\end{lstlisting}
358 \caption{One time step of pull-based spike transmission}
359 \label{lst:pull}
360\end{figure}
361
362\noindent Researchers have identified ‘‘pushing’’ and ‘‘pulling’’ as
363two general strategies for designing algorithms for graph problems
364(\cite{besta2017}; \cite{grossman2018}; \cite{ahangari2023}). A
365push strategy transfers signals \textit{from a node to its neighbours}
366by writing to the neighbours incoming signal buffers. In contrast, a
367pull strategy transfers signals \textit{to a node from its neighbours}
368by swee** through the nodes neighbours and checking whether they
369have any signals to be delivered to the node. I.e., the receiver node
370has to ‘‘go and ask’’ its neighbours whether they have signals for
371them. SNN simulation is a graph problem involving node-to-node
372transfer of signals and it too can be characterized in terms of
373pushing and pulling. Listing \ref{lst:push} and \ref{lst:pull}
374illustrate the two strategies. In both listings, the values of the
375scalars \verb!p11!, \verb!p21!, and \verb!p22! come from
376\cref{eqn:p11p22,eqn:p21}. The arrays \verb!U!, \verb!I!, and \verb!R!
377contain the neurons membrane potentials, presynaptic potentials, and
378refractory counters. When the neurons membrane potential exceeds
379\verb!U_tresh! the neuron spikes and becomes refractory for
380\verb!t_ref_tics! tics. The expression \verb!T[t, i]! denote the
381number of thalamic spikes received by the $i$:th neuron at the $t$:th
382time step.
383
384The strategies differ in how they transfer spikes. The push strategy
385transfers them when \verb!spikes! indicates that a neuron spikes. It
386calls \verb!syns_from! to fetch an iterator of all synapses
387\textit{originating} from the $i$:th neuron. The three-tuples $(j, d,
388w)$ it retrieves represent the synapses; $j$ the index of the
389destination neuron, $d$ the delay in time steps, and $w$ the
390current. For every three-tuple it writes to an element of \verb!W!, a
391two-dimensional array buffering current to be delivered. The element
392\verb!W[t, i]! is the amount of presynaptic current the $i$:th neuron
393receives at the $t$:th time step. The \verb!syns_to! call in listing
394\ref{lst:pull} works exactly like the \verb!syns_from! call, except it
395returns all synapses \textit{terminating} at the $i$:th neuron and $j$
396in the three-tuples $(j, d, w)$ denotes the \textit{originating}
397neuron. Array elements \verb!A[t, i]! indicate whether the $i$:th
398neuron spikes at time $t$ or not. The size of the longest synapse
399delay is in practice bounded and small so both \verb!A! and \verb!W!
400can be implemented as statically sized wrap-around arrays -- a
401technique we cover in \cref{sec:sizing}. Note that every neuron can be
402handled independently so we use the \verb!parfor! construct for both
403algorithms outer loops.
404
405Pushing of synaptic current to their neighbours happens on line 16 and
40617 of the push strategys listing. Neurons spike infreqently, but when
407they do, the algorithm must retrieve all their synapses and write
408their current to the array \verb!W!. This part of the algorithm is
409performance-critical. First, the algorithm accesses all the neurons
410synapses which can be expensive, even if they are stored in contiguous
411memory. Second, the algorithm gathers and scatters data from and to
412uncorrelated indices of \verb!W!. These are costly operations since
413the accesses to \verb!W! cannot be coalesced or cached.\footnote{The
414memory addresses are too far apart for any coalescing or caching
415scheme to be efficient.} To make matters worse, at one time step
416multiple neurons can write to the same indices of \verb!W!. I.e.,
417\verb!W[t + d, j] += w! has to be executed atomically to avoid data
418races. As the model is densely connected, races are not uncommon and
419should be accounted for. Partition-awareness, as suggested by
420\cite{besta2017}, is not an option because of the density and
421randomness of the connections, making most of them remote and not
422local.
423
424\begin{figure}
425 \begin{lstlisting}[style=pseudocode]
426for i in range(n_neurons):
427 {Update neuron state as before}
428 if spiked:
429 Q = enqueue(Q, i)
430for i in contents(Q):
431 parfor j, d, w in syns_from(i):
432 W[t + d, j] += w
433Q = clear(Q)\end{lstlisting}
434\caption{Deferred push-based spike transfer}
435\label{lst:push2}
436\end{figure}
437
438The pull-based algorithm instead has the receiver neurons responsible
439or ‘‘pulling in’’ current. Spiking merely sets an element in \verb!A!
440to true and does not trigger current transfer. On subsequent time
441steps, neurons connected to that neuron checks if it spiked at time
442$t-d$ and, if so, adds its synaptic current. The upside of this
443algorithm is that it is synchronization-free; neurons do not write to
444shared memory. The major drawback is that every neuron at every time
445step must check all its incoming synapses to see from which of them it
446receives current. Additionally, the algorithm reads from scattered
447memory on line 18. The large amount of data it reads probably makes
448it inefficient, unless the number of computational units is
449large and memory reads are substantially cheaper than writes. Neither
450of which is true for our FPGA. Consequently, we focus on push-based
451spike transfer in this work.
452
453Listing \ref{lst:push2} shows a variant of push-based spike transfer
454that first collects spiking neurons and then transfers their synapses
455current in a dedicated phase. Collection can either be done with a
456marking array, at the expensive of wasting memory, or with a queue (as
457in the listing), at the expense of making loop iterations
458dependent. The choice depends on the target platform. Though,
459splitting the update and spike transfer into two phases reduces the
460number of cache conflicts which is advantagenous. Lines
46113 to 15 of the basic push algorithm can flush prefetched and cached
462parts of the \verb!U!, \verb!I!, and \verb!R! arrays.
463
464\subsection{Buffer Sizing and Wrap**}
465\label{sec:sizing}
466
467To reduce the size of the \verb!W! array we use ‘‘wrap-around
468indexing’’. The indices of the rows in \verb!W! written to in one time
469step lies within the interval $t + 1$ to $t + d_\mathrm{max} - 1$,
470where $d_\mathrm{max} - 1$ is the networks largest synaptic delay and
471is small. Moreover, at time step $t$ rows $0$ to $t - 1$ will not be
472read again so that space can be reused. We do that by setting the
473number of rows in \verb!W! to $d_\mathrm{max}$ and use modular
474arithmetic to index rows. The expressions for accessing \verb!W! on
475line 15 and 17 in listing \ref{lst:push} become
476\verb!W[(t + d) % D_MAX, j]! and \verb!W[t % D_MAX, i]!,
477respectively. We choose a large enough value for $d_\mathrm{max}$ by
478evaluating the cumulative distribution function of the Gaussians we
479sample the slower excitatory synapses delays from, $\mathcal{N}(1.5,
4800.75)$:
481\begin{equation}
482 P(\mathcal{N}(1.5, 0.75) \leq 6.4) \approx 0.999999999968:
483\end{equation}
484This shows that with $\Delta t = 0.1$, 64 rows is more than enough
485since the probability of sampling a synaptic delay longer than 6.4 ms
486is astronomically low. It is also a power of two so we use masking to
487realize the modular arithmetic. With this scheme we also have to clear
488\verb!W[t, i]! after reading it to avoid double reads.
489
490Even with only 64 rows and assuming four-byte floats, the \verb!W!
491array still consumes $64\cdot4\cdot 77169 \approx 20$ megabytes which
492is more than we can fit in on-chip memory. We could store the array in
493off-chip memory -- which is plentiful -- or use half-precision two-byte
494floats instead. Neither solution is satisfactory. As we argued in
495\cref{sec:push-pull}, we need fast reads and writes to uncorrelated
496addresses of \verb!W! which off-chip memory doesnt give us. We also
497prefer not to lower the numeric the precision as that makes the
498simulation less accurate. A third option is to store all spiking
499neurons in a queue and only activate a subset of their synapses at a
500given time step.
501
502\subsection{Just-In-Time Spike Transfer}
503\begin{figure}
504 \begin{lstlisting}[style=pseudocode]
505parfor i in range(n_neurons):
506 {Update U[i], R[i] as before}
507 spiked[i] = {True if neuron spiked}
508 I[i] = p11*I[i] + T[t, i]*wpsn
509for rt in range(D_MAX):
510 delay = (t - rt) % D_MAX
511 if delay < D_MAX - 1:
512 for n in enqueued_at(Q, rt):
513 syns = syns_from(n, delay)
514 parfor j, d, w in syns:
515 I[j] += w
516 else:
517 Q = dequeue(Q, rt)
518rt = t % D_MAX
519for i in range(n_neurons):
520 if spiked[i]:
521 Q = enqueue(Q, rt, i)\end{lstlisting}
522 \caption{Three-phase just-in-time spike transfer}
523\label{lst:jit}
524\end{figure}
525
526\noindent By activating all the neurons synapses at once, the push
527algorithm works harder than necessary. Obviously, all synapses must
528\textit{eventually} be activated, but \textit{right now} only synapses
529with a delay of one time step must be activated since the current they
530transfer will be read at the next time step. This observation leads us
531to a ‘‘lazy’’ just-in-time algorithm which keeps all spiking neurons
532in a queue for a fixed number of time steps. At every time step it
533activates those synapses whose current is read at the next time
534step. Thus, a neuron that spikes at time $t$ will have its
535one-tick-delay synapses activated at time $t + 1$, its two-tick-delay
536synapses at time $t + 2$, and so on. Listing \ref{lst:jit} sketches a
537three-phase algorithm built on this idea. The first phase (line 1 to
5384) updates \verb!U! and \verb!R! as before and marks spiking neurons
539in \verb!spiked!. The third phase (lines 14 to 17) enqueues the marked
540neurons with a ‘‘relative timestamp’’, \verb!rt!. The second phase
541(lines 5 to 13) scans the queue and calls \verb!enqueued_at! to fetch
542previously enqueued neurons. The \verb!syns_from! function works as
543before, but now has a second parameter to select synapses with the
544given delay. Suppose that the simulator simulates time step 100 and
545that \verb!rt! is 10 in one iteration of the loop. Since $26 \equiv
546(100 - 10) \mod d_{\mathrm{max}}$ ($d_{\mathrm{max}}=64$) all synapses
547with 26 time steps of delay of queued neurons with a relative
548timestamp of 10 will be activated. These neurons ought to have been
549stored at time step $t = 100 - 26 = 76$ which is indeed the case since
550$10 \equiv 74 \mod d_\mathrm{max}$. After the neurons have stayed in
551the queue for \verb!D_MAX - 1! time steps, the \verb!dequeue! call
552evicts them. Though with a sensible FIFO implementation eviction is a
553no-op so we omit it in future pseudo-code.
554
555How much memory do the just-in-time algorithm save? Through
556experimentation we found that the average number of spiking neurons
557per time step is 23, meaning that the expected number of
558elements in \verb!Q! is $23\cdot d_{\mathrm{max}}=1472$. We generously
559round it up to 4096 and, as neuron indices take four-bytes to store,
560the total size of the queue is 16 kilobytes, which comfortably fits in
561on-chip memory.\footnote{As in \cref{sec:sizing}, we could use the
562cumulative distribution function to show that 4096 elements is enough
563to make the risk of overflow virtually zero.}
564
565To improve the just-in-time algorithm we use ‘‘lanes’’. Essentially,
566we duplicate the \verb!W! array so that synapses of multiple spiking
567neurons can be activated simultaneously. The algorithm writes the
568synaptic current of the first neuron it handles in the first lane, of
569the second neuron in the second lane, and so on. Synaptic current of
570neurons in different lanes does not interfere. The update phase has to
571be adjusted accordingly and must sum the incoming current from all
572lanes. As each lane consumes about 320 kb of memory we can fit at most
57316 lanes.
574
575\subsection{Horizon-Based Spike Transfer}
576\begin{figure}
577 \begin{lstlisting}[style=pseudocode]
578parfor i in range(n_neurons):
579 I[i] += W[t % H, i]
580 W[t % H, i] = 0
581 {update U[i] and R[i] as before}
582 I[i] = p11*I[i] + wspn*T[t, i]
583for i in range(D_MAX / H):
584 rt = (t - H*i - 1) % D_MAX
585 d_from = H*i + 1
586 d_to = d_from + H
587 for n in enqueued_at(Q, rt):
588 syns = syns_from(
589 n, d_from, d_to)
590 parfor j, d, w in syns:
591 W[(d + t) % H, j] += w
592rt = t % D_MAX
593for i in range(n_neurons):
594 if spiked[i]:
595 Q = enqueue(Q, rt, i)\end{lstlisting}
596 \caption{Three-phase spike transfer with configurable horizon}
597 \label{lst:horizon2}
598\end{figure}
599
600\noindent The just-in-time algorithm requires a fair amount of
601bookkee**. It activates the spiking neurons synapses
602$d_{\mathrm{max}}$ times instead of just once, and the loop on line 10
603and 11 iterates many fewer times than the corresponding loop on line
60414 and 15 of the basic push algorithm. This loop is an important
605source of parallelism and running it many times with fewer iterations
606is much worse for performance than running it fewer times with many
607iterations. Our solution is to reintroduce a smaller version of the
608spike buffer whose number of rows, $h$ is a factor of $d_\mathrm{max}$
609so that when a neuron spikes we write to the buffer $d_\mathrm{max} /
610h$ times. Listing \ref{lst:horizon2} shows the concept. The
611\verb!syns_from! function now retrieves synapses of the neuron whose
612delay is within the range $d_\mathrm{from}$ to $d_\mathrm{to} -
6131$. Suppose $d_\mathrm{max}=64$, $t=100$ and $h=16$. The loop on lines
6145 to 13 iterates four times, the relative timestamps assumes the
615values 35, 19, 3, and 51, and the half-open intervals the values [1,
616 17), [17, 33), [33, 49), and [49, 65). Thus, the inner loop on line
617 10 to 14 activates all synapses with the given delays of the
618 neurons stored at the given relative timestamps. With this
619 scheme we trade-off on-chip memory for better concurrency.
620
621Note that we add the current from the spike buffer to the presynaptic
622potential on line 2, before we update the membranes potential and
623we add one to \verb!d_from! on line 7. This is necessary since the
624algorithm transfers spikes one time step later than the basic
625push algorithm.
626
627\subsection{Storing Synapses}
628\label{sec:syn-store}
629\begin{figure}
630 \begin{lstlisting}[style=pseudocode]
631def syns_from(n, d_from, d_to):
632 start = X[n, d_from]
633 end = X[n, d_to]
634 for i in range(start, end):
635 yield S[i]\end{lstlisting}
636\caption{Indexed access to synapse data}
637\label{lst:index}
638\end{figure}
639
640\noindent Previous sections pseudo-codes imply that it is very
641important that synapses can be queried by their sender neuron
642quickly. In particular, synapses should be stored so that one
643\verb!syns_from! call only accesses memory in one contiguous chunk. We
644fullfill these goals by storing the synapses as an array sorted on
645sender neuron, receiver neuron, and delay that we query with a
646prebuilt index keyed on sender neuron and delay. This means that
647finding all synapses for a particular neuron or neuron-delay
648combination requires only two index lookups; one for the first synapse
649and one for the last synapse. Moreover, as the index implicitly stores
650the sender neuron, we only need eight bytes to represent a synapse;
651four for the single-precision weight (32 bits), three for the
652destination neurons id (17 bits), and one for the synaptic delay (6
653bits).\footnote{With just-in-time transfer the delay is also stored
654implicitly.} Listing \ref{lst:index} shows how \verb!syns_from!
655performs index look-ups. The two-dimensional array \verb!X!
656represents the index and \verb!S! the synapse array so that
657\verb!X[n, d_from]! contains the index in \verb!S! where the first
658synapse of neuron \verb!n! with delay \verb!d_from! is
659stored. Ideally, the synapses should also be prefetched.
660
661\begin{figure}
662 \begin{lstlisting}[style=pseudocode]
663parfor c in range(N_CLS):
664 for i in contents(Q):
665 syns = syns_from(i, c)
666 for j, d, w in syns:
667 W[t + d, j] += w
668Q = clear(Q)\end{lstlisting}
669\caption{Spike transfer with partitioned synapses}
670\label{lst:parts}
671\end{figure}
672
673\subsection{Disjoint Synapses}
674\label{sec:disjoint}
675
676A headache for push-based spike transfer is the data race caused by
677multiple synapses delivering current received by the same neuron at
678the same time step. This is why we cant use \verb!parfor! on line 5
679of listing \ref{lst:push2}, for example. We cannot completely solve
680this problem, but we can alleviate it by partitioning the synapses
681into disjoint \textit{classes}, so that synapses of different classes
682never trigger writes to the same memory addresses at the same
683time. Then every class can be handled by a separate thread. The
684pseudocode in listing \ref{lst:parts} modifies the transfer phase of
685the deferred push algorithm from \ref{lst:push2} to exploit of this
686idea.\footnote{We can of course us the same technique to improve all
687other algorithms we have discussed.} The constant \verb!N_CLS!
688denotes the number of synapse classes and the extra argument to
689\verb!syns_from! which synapse class to query. Many ways of
690partitioning the synapses into disjoint classes are possible. For
691example, by delay so that one thread writes one-time step synapses,
692the next thread two-time steps synapses, and so on. Another by
693destination neuron so that one thread handles synapses going to
694neurons whose index is between 0 and 199, another those
695between 200 and 399, and so on.
696
697\begin{figure}
698 \begin{lstlisting}[style=pseudocode]
699o0 = X[i][0]
700o1 = X[i][D_MAX]
701parfor c in range(N_CLS):
702 for o in range(o0 + c, o1, N_CLS):
703 j, d, w = S[o]
704 W[(t + d) % D_MAX, j] += w\end{lstlisting}
705 \caption{Spike transfer over interleaved synaptic storage.}
706 \label{lst:parts2}
707\end{figure}
708
709We choose to partition by the destination neurons congruence
710class. The method has low overhead and evenly distributes the synapses
711over the classes since the least significant bits of the neuron index
712is almost random. To retain the contiguous storage we interleave the
713synapses. That is, if a neurons synapses are found at indexes $o_0$
714to $o_1 - 1$, then all synapses of class $c$ are stored at indices
715$o_0 + n_ci + c$, where $n_c$ is the number of classes and $i$ is a
716non-negative integer. Interleaving synergizes with banked memory
717common on many GPUs. On our FPGA, it means that we can have
718conflict-free dedicated memory ports for every synapse class. The
719method causes some memory waste however. For example, if there are
720four classes and all destination neuron indices of all synapses of
721some neuron happen to be congruent with $2 \mod 4$, then all indices
722other than $o_0 + 4i + 2$ will be vacant. I.e., 75% of the space will
723go to waste. In general, the memory consumption for storing a neurons
724synapses grows from $n_ss$, where $n_s$ is its number of synapses and
725$s$ the synapse size to $n_cls$, where $l$ is the number of synapses
726in the neurons largest class. In practice, the memory waste
727is manageable; in the order of 5-30% depending on the number of
728classes and horizon. The more the classes and the shorter the horizon
729the more uneven the classes become and the more waste. We do not mark
730vacancies and instead fill them with synapses carrying no current and
731terminating at idempotent neurons. This way, the code in listing
732\ref{lst:parts2} does not need to check whether the index is vacant.
733
734\subsection{More on Parallelism}
735\label{sec:more-para}
736\begin{figure*}
737 \begin{subfigure}{0.48\textwidth}
738 \begin{lstlisting}[style=pseudocode]
739parfor i in range(n_neurons):
740 {Update neuron state as before}
741 if spikes:
742 write(to_transfer, i)
743write(to_update, DONE)
744read(to_update)\end{lstlisting}
745 \caption{Update kernel}
746 \end{subfigure}
747 \begin{subfigure}{0.48\textwidth}
748 \begin{lstlisting}[style=pseudocode]
749while True:
750 i = read(to_transfer)
751 if i == DONE:
752 break
753 parfor j, d, w in syns_from(i):
754 W[t + d, j] += w
755 write(to_update, True)\end{lstlisting}
756 \caption{Transfer kernel}
757 \end{subfigure}
758 \caption{Basic spike transfer using two kernels}
759 \label{lst:multi}
760\end{figure*}
761
762\noindent Our algorithms source of parallelism is the \verb!parfor!
763keyword. Iterations of such loops are independent and can be executed
764concurrently in duplicated hardware. The replication is realized
765differently on different targets. On CPUs we use SIMD and on GPUs the
766single-instruction multiple threads (SIMT) execution model
767‘‘automatically’’ parallelizes the computation. With Intels OpenCL
768SDK we use the \verb!#pragma unroll N! and \verb!#pragma ivdep!
769compiler directives to instruct the compiler to replicate loop
770hardware. Essentially, this ‘‘widens’’ data paths, allowing more data
771to be processed in parallel. But FPGAs can also run more data paths in
772parallel, akin to how multiple CPU cores can run multiple threads. We
773implement this parallelism by dividing the algorithms over multiple
774concurrent communicating kernels.
775
776Listing \ref{lst:multi} restructures the deferred push-based spike
777transfer algorithm in \ref{lst:push2} as two kernels. It uses two
778blocking FIFOs, \verb!to_transfer! and \verb!to_update!, that the
779kernels can \verb!read! and \verb!write! to. When a neuron spikes, the
780update kernel sends its index to the spiking kernel which activates
781that neurons synapses. Hence, the neuron update and spike transfer
782phases run in parallel. When the update kernel has updated all neurons
783it sends a \verb!DONE! message and waits for a reply from the spike
784transfer kernel. When it receives one, it knows that the spike
785transfer kernel has transferred all spikes and it proceeds to the next
786time step. As inter-kernel communication is not well-supported in
787OpenCL, we implement it using the low-latency Intel-specific channel
788extension.
789
790\begin{figure*}
791 \begin{subfigure}{0.48\textwidth}
792 \begin{lstlisting}[style=pseudocode]
793for i in range(n_neurons):
794 I[i] += read(to_update)
795A = []
796for i in range(n_neurons):
797 {Update neuron state as before}
798 if spiked[i]:
799 A.append(i)
800write(to_transfer, A)\end{lstlisting}
801 \caption{Update kernel}
802 \end{subfigure}
803 \begin{subfigure}{0.48\textwidth}
804 \begin{lstlisting}[style=pseudocode]
805for i in range(n_neurons):
806 write(to_update, W[t % H, i])
807 W[t % H, i] = 0
808for i in range(D_MAX / H):
809 rt = (t - H*i - 1) % D_MAX
810 for n in enqueued_at(Q, rt):
811 syns = syns_from(n, rt + H)
812 parfor j, d, w in syns:
813 W[(d + t) % H, j] += w
814for n in read(to_transfer):
815 Q = enqueue(Q, t % D_MAX, n)\end{lstlisting}
816 \caption{Transfer kernel}
817 \end{subfigure}
818 \caption{Horizon-based spike transfer using two kernels}
819 \label{lst:multi-horiz}
820\end{figure*}
821
822Listing \ref{lst:multi-horiz} similarily restructures the
823horizon-based algorithm as two kernels. Since the \verb!W! array is
824stored in local memory, the multi-kernel approach requires an explicit
825synchronization step (lines 1 and 2, and 1 to 3) for sending
826presynaptic current from the transfer kernel to the update kernel over
827the \verb!to_update! channel. After synchronization, the transfer
828kernel can freely write synaptic current from queued neurons to
829\verb!W!. While updating neurons, the update kernel keeps track of
830which of them spiked and sends them to the transfer kernel. The
831transfer kernel adds them to its queue \verb!Q!.
832
833\subsection{Implementations}
834\label{sec:impl}
835\begin{figure}
836 \center
837 \includegraphics[width=\linewidth]{tree}
838 \caption{Taxonomy of our simulator implementations}
839 \label{fig:taxonomy}
840\end{figure}
841\begin{figure*}
842 \center
843 \includegraphics[width=\linewidth]{impl_gmem}
844 \caption{Schematic of the monolithic gmem implementations}
845 \label{fig:gmem}
846\end{figure*}
847\begin{figure*}
848 \center
849 \includegraphics[width=\linewidth]{impl_horizon}
850 \caption{Schematic of the multi-kernel horiz implementations}
851 \label{fig:gmem}
852\end{figure*}
853
854Given the algorithms and data structures presented, we have created
855twelve simulator families arranged into a taxonomy in
856\cref{fig:taxonomy}. The taxonomys second level groups the simulators
857on their spike transfer algorithm. It can either be the basic
858push-algorithm from listing \ref{lst:push} (gmem), the just-in-time
859algorithm from listing \ref{lst:jit} (jit), or the horizon-based
860algorithm from listing \ref{lst:horizon2} (horiz). These three
861algorithms differ in how eagerly they activate synapses. The gmem
862algorithm is maximally eager and activates all the spiking neurons
863synapses at once. Thus, it has to use global off-chip memory to transfer
864spikes, hence its name. The jit algorithm is maximally lazy and only
865activates synapses whose current should arrive at the next time
866step. The horizon algorithm is a mix of the two and activates as many
867synapses as its horizon allows. The taxonomys third level shows
868whether the simulator uses multiple communicating kernels (multi) or a
869single, monolithic kernel (mono). The last level specifies whether the
870simulator uses double (d) or single precision (s) floating point
871values for neuron state. This applies to the membrane potentials,
872presynaptic currents, and their related coefficients, but not to the
873spikes and spike transfer buffers which always are in
874single-precision. In the following, we use a notation based on their
875taxonomic grou** to refer to the simulators. For example,
876‘‘horiz/multi/s’’ refers to the horizon-based multi-kernel
877single-precision simulators.
878
879All simulators also have parameters for tuning their performance
880characteristics. Some are unique to certain families while others are
881shared. The \textit{update width} is the most
882fundamental one and controls how many neurons the simulator updates at
883once. We implement it by unrolling loops such as the loop on line 1 to
8844 of listing \ref{lst:jit}. More loop-unrolling is not always better
885as it causes the compiler to duplicate the hardware used to synthesize
886those loops. The gmem/mono and gmem/multi simulators have a
887\textit{synapse unroll} (SU) parameter which controls how many times
888their spike transfer loops are unrolled (line 6 and 7 of listing
889\ref{lst:push2} and lines 5 and 6 of listing \ref{lst:multi}). More
890unrolling allows more synapses to be fetched from memory
891simultaneously. The \textit{synapse classes} (SC) parameter determines
892the number of congruence classes (see \cref{sec:disjoint}). More
893classes allows more synapses to be activated in parallel, but also
894increases memory usage. The horiz simulators have a
895\textit{horizon length} (H) parameter which determines how many time
896steps worth of synapses the simulator activates at once. The jit
897simulators have a \textit{lane count} (LC) parameter that controls how
898many lanes they uses to transfer current. For convenience and
899efficiency, all parameters are positive powers of two.
900
901\section{Results}
902\label{sec:results}
903
904We evaluate our simulators on three axes -- correctness, speed, and
905energy consumption -- using the same experimental regimen for each. We
906synthesize every simulator for our FPGA with Intel FPGA SDK for OpenCL
907version 21.2 and use it to simulate ten seconds of biological time
908with a time step of 0.1 ms (i.e. 100 000 ticks) on ten different
909random microcircuit instances. From these 100 runs we compute mean
910values. We run all simulators on the full version (scale 1.0) of the
911microcircuit. Due to time constraints, we only synthesize a handful of
912simulators for every family and parameter combination. Some simulators
913could achieve better performance with a more exhaustive search.
914
915\subsection{Correctness}
916\begin{figure*}
917 \center
918 \includegraphics[width=\linewidth]{scatter}
919 \caption{Spike plot of the first 1000 ms of simulation. The
920 rhythmic nature of the microcircuits spiking pattern is apparent.}
921 \label{fig:scatter}
922\end{figure*}
923\begin{figure*}
924 \begin{subfigure}{.33\textwidth}
925 \centering
926 \includegraphics[width=0.9\linewidth]{counts}
927 \caption{Spiking rate}
928 \end{subfigure}
929 \begin{subfigure}{.33\textwidth}
930 \centering
931 \includegraphics[width=0.9\linewidth]{cvisi}
932 \caption{CV ISI}
933 \end{subfigure}
934 \begin{subfigure}{.33\textwidth}
935 \centering
936 \includegraphics[width=0.9\linewidth]{pearsonccs}
937 \caption{Pearson corr. coeff.}
938 \end{subfigure}
939 \caption{Kernel density estimates of \textbf{a)} spikes per second,
940 \textbf{b)} covariance of interspike intervals, and \textbf{c)}
941 Pearson correlation coefficient between binned spike trains for
942 neuron samples. Blue lines for NEST, gold for gmem/mono/s, and
943 green for jit/mono/s.}
944 \label{fig:stats}
945\end{figure*}
946\begin{figure}
947 \center
948 \includegraphics[width=\linewidth]{kldiv}
949 \caption{Kullback-Leibler (KL) divergence between distributions of
950 single-neuron firing rates, CV ISIs, and Pearson correlation
951 coefficients. Blue bars show KL divergences between NEST
952 simulations run on different random seeds, gold and green bars
953 between NEST and gmem/mono/s and jit/mono/s run on the same random
954 seeds.}
955 \label{fig:kldiv}
956\end{figure}
957
958We verify our simulators correctness using the same methods
959\cite{knight2018}; \cite{vanalbada2018}; \citep{potjans2014}; and
960others used. I.e., we simulate the microcircuit initialized with the
961same random seed with our simulators and with grid-based
962(double-precision) NEST which we treat as our reference. We discard
963spikes during the first second and compute the following statistics
964over the remaining nine seconds over the eight populations; rate of
965spikes per second, covariant of interspike intervals, and Pearson
966correlation coefficient over binned spike trains. We smooth each
967distribution with Gaussian kernel density estimation with bandwidth
968selected using Scotts Rule.
969
970\Cref{fig:stats} shows the distributions plotted for NEST in blue, the
971gmem/mono/s family in gold, and the jit/multi/s family in green. The
972plots indicate that the simulators produce distributions that are very
973similar to NESTs which implies that the accuracy loss caused by the
974reduced numerical precision is negligible. We do not plot the
975distributions for our other families of single-precision simulators as
976they are even closer to NESTs distributions. Neither do we plot
977distributions for our double-precision simulators as they produce
978results that are spike-for-spike identical to NEST. The
979Kullback-Leibler (KL) divergence between the distributions, shown in
980\cref{fig:kldiv}, quantifies the apparent similarities. The figures
981blue, gold, and green bars show the KL divergences between two NEST
982simulations initialized with different random seeds, between NEST and
983gmem/mono/s, and between NEST and jit/mono/s. The latter two
984initialized with identical seeds. The divergence between two random
985seeds are much larger than between NEST and our simulators, indicating
986that they are accurate.
987
988The non-associativity of IEEE 754 floating-point is the main reason
989for the small differences. The order presynaptic current is added
990depends on the spike transfer algorithm, which affects rounding. The
991differences are stochastic and do not bias the result.
992
993\subsection{Simulation speed}
994
995%% \begin{table*}
996%% \centering
997%% \footnotesize
998%% \begin{tabular}{crrrrrrrrrr}
999%% \toprule
1000%% \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{RTF} & \textbf{Fmax} & \textbf{ALUT} & \textbf{FF} & \textbf{RAM} & \textbf{DSP}\\
1001%% \midrule
1002%% gmem/mono/s & 1 & 2 & n/a & n/a & 5.84 & 608 & 5% & 3% & 13% & 0%\\
1003%% ’’ & 2 & 2 & n/a & n/a & 5.24 & 608 & 5% & 3% & 14% & 0%\\
1004%% ’’ & 4 & 2 & n/a & n/a & 4.88 & 608 & 5% & 3% & 14% & 1%\\
1005%% ’’ & 8 & 2 & n/a & n/a & 4.78 & 609 & 6% & 3% & 13% & 1%\\
1006%% ’’ & 8 & 4 & n/a & n/a & 4.82 & 607 & 7% & 4% & 15% & 1%\\
1007%% ’’ & 8 & 8 & n/a & n/a & 5.07 & 572 & 8% & 5% & 17% & 1%\\
1008%% ’’ & 16 & 2 & n/a & n/a & 4.63 & 607 & 7% & 4% & 14% & 1%\\
1009%% ’’ & 16 & 4 & n/a & n/a & 4.72 & 607 & 8% & 4% & 15% & 2%\\
1010%% ’’ & 32 & 2 & n/a & n/a & 4.66 & 600 & 9% & 4% & 14% & 3%\\
1011%% \specialrule{0.25pt}{1pt}{1pt}
1012%% gmem/mono/d & 8 & 2 & n/a & n/a & 4.75 & 612 & 16% & 7% & 19% & 2%\\
1013%% ’’ & 16 & 2 & n/a & n/a & 4.65 & 600 & 27% & 11% & 19% & 4%\\
1014%% ’’ & 32 & 2 & n/a & n/a & 4.64 & 602 & 49% & 20% & 21% & 9%\\
1015%% \specialrule{0.25pt}{1pt}{1pt}
1016%% gmem/multi/s & 8 & 2 & n/a & n/a & 4.77 & 611 & 5% & 3% & 13% & 1%\\
1017%% ’’ & 16 & 2 & n/a & n/a & 4.92 & 608 & 6% & 3% & 12% & 1%\\
1018%% \specialrule{0.25pt}{1pt}{1pt}
1019%% gmem/multi/d & 8 & 2 & n/a & n/a & 4.85 & 609 & 15% & 7% & 18% & 2%\\
1020%% ’’ & 32 & 2 & n/a & n/a & 4.89 & 609 & 47% & 19% & 20% & 9%\\
1021%% \specialrule{0.25pt}{1pt}{1pt}
1022%% jit/mono/s & 2 & n/a & n/a & 2 & 5.07 & 602 & 5% & 3% & 12% & 0%\\
1023%% ’’ & 4 & n/a & n/a & 4 & 4.64 & 611 & 5% & 3% & 12% & 0%\\
1024%% ’’ & 8 & n/a & n/a & 8 & 4.41 & 596 & 6% & 3% & 12% & 1%\\
1025%% ’’ & 16 & n/a & n/a & 16 & 4.58 & 614 & 7% & 3% & 12% & 2%\\
1026%% \specialrule{0.25pt}{1pt}{1pt}
1027%% jit/mono/d & 8 & n/a & n/a & 8 & 4.73 & 612 & 16% & 6% & 19% & 2%\\
1028%% \specialrule{0.25pt}{1pt}{1pt}
1029%% jit/multi/s & 8 & n/a & n/a & 8 & 4.67 & 610 & 7% & 3% & 15% & 1%\\
1030%% \specialrule{0.25pt}{1pt}{1pt}
1031%% jit/multi/d & 4 & n/a & n/a & 4 & 5.27 & 604 & 11% & 5% & 24% & 1%\\
1032%% ’’ & 8 & n/a & n/a & 8 & 5.17 & 604 & 17% & 7% & 24% & 2%\\
1033%% \specialrule{0.25pt}{1pt}{1pt}
1034%% horiz/mono/s & 32 & n/a & 16 & 32 & 0.81 & 605 & 10% & 4% & 70% & 4%\\
1035%% ’’ & 64 & n/a & 16 & 64 & 0.87 & 577 & 14% & 6% & 72% & 7%\\
1036%% \specialrule{0.25pt}{1pt}{1pt}
1037%% horiz/mono/d & 16 & n/a & 16 & 16 & 0.84 & 602 & 27% & 11% & 77% & 5%\\
1038%% ’’ & 32 & n/a & 16 & 32 & 0.81 & 597 & 48% & 19% & 76% & 9%\\
1039%% \specialrule{0.25pt}{1pt}{1pt}
1040%% horiz/multi/s & 32 & n/a & 16 & 32 & 1.30 & 609 & 10% & 5% & 72% & 4%\\
1041%% \specialrule{0.25pt}{1pt}{1pt}
1042%% horiz/multi/d & 16 & n/a & 16 & 32 & 0.81 & 575 & xx% & xx% & xx% & x%\\
1043%% \bottomrule
1044%% \end{tabular}
1045%% \caption{Performance of some configurations of our simulators (Old make new!)}
1046%% \label{tbl:perf2}
1047%% \end{table*}
1048
1049\begin{table*}
1050 \centering
1051 \footnotesize
1052 \begin{tabular}{crrrrrrrrrrrr}
1053 \toprule
1054 \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{LC} & \textbf{RTF} & \textbf{Freq.} & \textbf{ALUT} & \textbf{Reg.} & \textbf{ALM} & \textbf{M20K} & \textbf{DSP}\\
1055 \midrule
1056 gmem/mono/s & 8 & 2 & n/a & n/a & n/a & 4.79 & 601 & 126k & 313k & 21% & 23% & 1%\\
1057 ’’ & 64 & 2 & n/a & n/a & n/a & 4.93 & 585 & 155k & 391k & 26% & 24% & 2%\\
1058 \specialrule{0.25pt}{1pt}{1pt}
1059 gmem/mono/d & 8 & 1 & n/a & n/a & n/a & 5.38 & 608 & 141k & 371k & 24% & 27% & 2%\\
1060 ’’ & 8 & 2 & n/a & n/a & n/a & 4.71 & 608 & 147k & 394k & 25% & 28% & 2%\\
1061 ’’ & 16 & 2 & n/a & n/a & n/a & 4.61 & 605 & 182k & 477k & 30% & 28% & 4%\\
1062 \specialrule{0.25pt}{1pt}{1pt}
1063 gmem/multi/s & 4 & 4 & n/a & n/a & n/a & 4.91 & 600 & 122k & 294k & 20% & 22% & 0%\\
1064 ’’ & 64 & 1 & n/a & n/a & n/a & 5.65 & 585 & 128k & 340k & 22% & 22% & 6%\\
1065 \specialrule{0.25pt}{1pt}{1pt}
1066 gmem/multi/d & 8 & 1 & n/a & n/a & n/a & 5.52 & 609 & 139k & 365k & 23% & 26% & 2%\\
1067 ’’ & 8 & 2 & n/a & n/a & n/a & 4.86 & 605 & 145k & 360k & 24% & 27% & 2%\\
1068 \specialrule{0.25pt}{1pt}{1pt}
1069 horiz/mono/s & 32 & n/a & 16 & 32 & n/a & 0.81 & 608 & 127k & 329k & 22% & 79% & 4%\\
1070 ’’ & 64 & n/a & 16 & 16 & n/a & 0.85 & 604 & 143k & 375k & 25% & 80% & 7%\\
1071 \specialrule{0.25pt}{1pt}{1pt}
1072 horiz/mono/d & 4 & n/a & 8 & 4 & n/a & 1.74 & 609 & 122k & 333k & 21% & 55% & 1%\\
1073 ’’ & 32 & n/a & 16 & 32 & n/a & 0.82 & 601 & 258k & 653k & 42% & 85% & 9%\\
1074 \specialrule{0.25pt}{1pt}{1pt}
1075 horiz/multi/s & 4 & n/a & 8 & 4 & n/a & 1.73 & 610 & 108k & 287k & 19% & 51% & 1%\\
1076 ’’ & 16 & n/a & 16 & 16 & n/a & 0.81 & 605 & 123k & 330k & 22% & 80% & 2%\\
1077 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 601 & 133k & 333k & 23% & 81% & 4%\\
1078 \specialrule{0.25pt}{1pt}{1pt}
1079 horiz/multi/d & 4 & n/a & 8 & 4 & n/a & 1.79 & 583 & 126k & 319k & 22% & 57% & 1%\\
1080 ’’ & 16 & n/a & 16 & 16 & n/a & 0.80 & 607 & 187k & 492k & 32% & 85% & 5%\\
1081 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 600 & 265k & 664k & 43% & 86% & 9%\\
1082 \specialrule{0.25pt}{1pt}{1pt}
1083 jit/mono/s & 16 & n/a & n/a & 16 & 16 & 1.47 & 590 & 116k & 319k & 22% & 79% & 7%\\
1084 \specialrule{0.25pt}{1pt}{1pt}
1085 jit/mono/d & 4 & n/a & n/a & 4 & 8 & 2.23 & 611 & 122k & 317k & 21% & 56% & 2%\\
1086 ’’ & 16 & n/a & n/a & 16 & 16 & 1.44 & 597 & 182k & 478k & 32% & 85% & 10%\\
1087 ’’ & 32 & n/a & n/a & 32 & 16 & 1.50 & 584 & 268k & 704k & 44% & 85% & 20%\\
1088 \specialrule{0.25pt}{1pt}{1pt}
1089 jit/multi/s & 16 & n/a & n/a & 16 & 16 & 1.43 & 593 & 120k & 325k & 22% & 79% & 7%\\
1090 \specialrule{0.25pt}{1pt}{1pt}
1091 jit/multi/d & 4 & n/a & n/a & 4 & 8 & 2.21 & 600 & 125k & 317k & 22% & 55% & 2%\\
1092 ’’ & 8 & n/a & n/a & 8 & 16 & 1.56 & 602 & 147k & 393k & 26% & 84% & 5%\\
1093 ’’ & 16 & n/a & n/a & 16 & 16 & 1.34 & 606 & 186k & 492k & 33% & 85% & 10%\\
1094 \bottomrule
1095 \end{tabular}
1096 \caption{Speed and resource usage for some simulators. The table
1097 includes each familys fastest simulator and some
1098 others for comparison purposes.}
1099 \label{tbl:perf3}
1100\end{table*}
1101
1102\begin{figure}
1103 \center
1104 \includegraphics[width=\linewidth]{update_widths}
1105 \caption{RTF as a function of update width for the fastest gmem/mono/s
1106 (blue), gmem/mono/d (gold), gmem/multi/s (green), and gmem/multi/d
1107 (red) simulators. Update widths larger than four does not decrease
1108 RTF.}
1109 \label{fig:uwidth}
1110\end{figure}
1111
1112\Cref{tbl:perf3} presents the performance and resource usage of our
1113fastest simulator configuraions. The first column shows the
1114simulators family in slash-notation (see \cref{sec:impl}). The next
1115five its parameters; \textit{update width} (UW), \textit{synapse
1116 unroll} (SU), \textit{horizon length} (H), \textit{synapse classes}
1117(SC), and \textit{lane count} (LC). The following column shows its
1118real-time factor (RTF), defined as the time taken to run the
1119simulation -- wall-clock time -- divided by the duration of the
1120simulated biological time (10 seconds). We measure the wall-clock time
1121as the time from the first \verb!clEnqueueNDRangeKernel! call until the
1122final simulation result can be read from the FPGAs memory. As the RTF
1123does not vary beyond two decimal places even on runs on networks
1124initialized with different random seeds, we just report its mean. The
1125next columns shows the simulators operating frequency in MHz and FPGA
1126resource usage. \Cref{fig:rtf} plots the RTF of the fastest
1127simulators from every family as a bar chart.
1128
1129A startling finding is that there is no significant performance
1130difference between double and single precision neuron state and
1131between multi- and single-kernel implementations. The fastest
1132horiz/mono/s and horiz/multi/d simulators RTF is 0.81 and 0.79 which
1133is very close to each other. Presumably, the spike transfer phase is
1134much more expensive than the neuron update phase so making the latter
1135run faster, either by using single-precision or by overlap** the
1136update and transfer phases, does not improve performance. For the same
1137reason, increasing the gmem simulators \textit{update width} is
1138ineffective, as \cref{fig:uwidth} shows. Doubling the \textit{update
1139 width} roughly halves the number of cycles spent updating neuron
1140state, but even that is insignificant. The performance even detoriates
1141for \textit{update width} larger than 16. Likely because the
1142FPGAs DDR interface is 512 bits wide.
1143
1144Neither the multi- nor single-kernel gmem simulators benefit from
1145more \textit{synapse unroll}. The reason could be because
1146the unrolled versions of the loop contains a false memory
1147dependency. There is no way of letting the OpenCL compiler know
1148that two iterations of the loop body -- \verb!W[t + d, j] += w! --
1149writes to distinct memory locations so the compiler refuses to
1150schedule multiple writes per clock cycle.
1151
1152The jit and horiz simulators are markedly faster than the gmem
1153simulators because they transfer spikes in on-chip memory. The best
1154gmem simulator has an RTF of 4.61 while it is 1.50 for the best jit
1155simulator and 0.79 for the best horiz simulator. The latter simulators
1156also uses up to 90% of the on-chip memory for storing the horizon and
1157lane buffers. The results suggest that the larger these buffers are
1158the better the performance. Unfortunately, our FPGA can not fit
1159horizons longer than 16 time steps or more than 16 lanes. The horiz
1160simulators performance edge over the jit simulators is due to them
1161running the spike transfer loop fewer times. The jit simulators run it
1162$d_\mathrm{max}$ times (i.e. 64) for every spiking neuron, while the
1163horizon simulators only runs it $d_\mathrm{max}/h$ times (i.e.,
116464/16=4 times). The horiz simulators also does not have to sum the
1165incoming current from multiple lanes when updating the neurons state.
1166
1167\begin{figure*}
1168 \center
1169 \includegraphics[width=\linewidth]{synapse_classes}
1170 \caption{The horizon and JIT algorithms RTF as a function of the
1171 number of \textit{synapse classes}. The \textit{horizon length}
1172 and \textit{lane count} parameters are both 16. The four colors
1173 represent the four implementation styles.}
1174 \label{fig:synapse_classes}
1175\end{figure*}
1176
1177\Cref{fig:synapse_classes} plots RTF as a function of the number of
1178\textit{synapse classes} for the horiz and jit simulators. For both,
1179performance improves until the parameter reaches 32. The more classes,
1180the more synapses can be activated in parallel which, clearly, is
1181important for performance. The drawback of increasing the number of
1182classes is memory waste. Especially for the jit simulators which
1183iterate many fewer times per spike transfer loop. With 16 classes the
1184average occupancy is only about 74%. It decreases to 47% with 32
1185classes, meaning that the simulator accesses more than twice as much
1186data than it needs. Interestingly, there is no performance
1187penalty. The benefits of handling many synapses in parallel is worth
1188lots of extra off-chip memory reads. Perhaps since we store the synapses
1189in contiguous memory reading them is quite cheap.
1190
1191All simulators run at around 600 MHz, considered very good for HLS. In
1192our experience, operating frequencies much lower than that were
1193caused by inefficiently banked on-chip memory, unaccounted for
1194loop-carried dependencies, or similar issues.
1195
1196%% The fastest horiz and jit simulators have a \textit{horizon length}
1197%% and \textit{lane count} of 16.
1198
1199%% Both the horiz and jit simulators transfer spikes through local memory
1200%% and have a large performance advantage compared to the gmem
1201%% simulators. These simulators use between 80 to 90% of local memory
1202%% and would surely benefit from having even more available. A \textit{horizon length
1203
1204%% Doubling the
1205%% \textit{horizon length} would require twice as much local memory, but
1206%% would cut the number of times the neurons synapses had to be accessed
1207%% in half.
1208
1209%% The performance of the gmem simulators is comparable to that of the
1210%% lmem/jit simulators, even though the former transmit spikes through
1211%% global memory. As they activate all of the spiking neurons synapses
1212%% in one go -- rather than spreading out the activations over multiple
1213%% time steps -- they are simpler and require less bookkee** than their
1214%% local-memory-using counterparts. Likely, this is what makes them
1215%% competitive. Neither the monolithic or multi-kernel versions of the
1216%% simulators benefit from having their spike propagation loops unrolled
1217%% more than four times because the unrolled version of the loop contains
1218%% a false memory dependency. As there is no way of letting the OpenCL
1219%% compiler know that two iterations of the loop body --
1220%% \verb!W[t + d, j] += w! -- always writes to distinct memory addresses
1221%% the compiler refuses to schedule multiple writes per cycle.
1222
1223
1224
1225%% The results also show that the multi-kernel implementations are less
1226%% efficient than the single-kernel ones. While channels are low-overhead
1227%% on Intel FPGAs, the overhead is still present and offsets whatever
1228%% benefit one gets from running the algorithms phases in
1229%% parallel. Presumably, because spike propagation is by far the most
1230%% expensive phase. This is in concordance with another observation; the
1231%% speed difference between single and double precision neuronal state is
1232%% negligible. Performance is not limited by compute resources, but by
1233%% memory bandwidth. However, double precision still carries a penalty in
1234%% that we have to use twice as much block memory to store the neuronal
1235%% state. The number of DSPs required is also larger than for
1236%% single-precision implementations.
1237
1238%% The lmem/horiz simulators are the fastest by a wide margin. As the
1239%% performance difference between the mono and multi variants indicate,
1240%% they benefit greatly from running the update and propagation phases in
1241%% parallel. This is because they use local memory to transmit spikes so
1242%% the propagation phase is, relatively speaking, cheaper and the neuron
1243%% update phase costlier. Thus, the benefit of running the phases in
1244%% parallel is bigger. Moreover, the usage of channels means that kernels
1245%% do not have to produced buffers of spiking neurons that are then
1246%% consumed in a future phase. Instead, the kernels can send spiking
1247%% neurons to the consuming kernel directly. All kernels of this family
1248%% benefit from larger horizons. Larger horizons mean that more synapses
1249%% can be activated simultaneously which is good for parallelization. It
1250%% also meshes well with larger \textit{synapse classes} -- the more
1251%% synapses per class the less memory waste. Unfortunately, the boards
1252%% block ram can not fit horizons larger than 16.
1253
1254\begin{figure}
1255 \center
1256 \includegraphics[width=\linewidth]{rtf_bars}
1257 \caption{RTF of the fastest simulator of each type. The four colors
1258 represent the four implementation styles.}
1259 \label{fig:rtf}
1260\end{figure}
1261
1262
1263\subsection{Energy Usage}
1264
1265We use the Terasic Dashboard GUI to measure the energy usage of our
1266fastest simulator -- horiz/multi/s with the parameters H=16, UW=32,
1267and SC=32. The dashboard connects to the Agilex 7 via a MAX 10 device
1268that continuously monitors the voltage and current rails going into
1269the board and the FPGA itself \citep{dkug}, allowing us to measure
1270both a ‘‘pessimistic’’ power consumption for the whole board (which
1271includes unused peripherals that draw power) and an ‘‘optimistic’’ one
1272for only the FPGA fabric. Due to the devices low sampling rate we
1273compute the total energy usage as the product of the maximum power
1274draw and the simulation time. According to our measurements, the
1275simulator pessimistically requires 44.9 W $\cdot$ 8.13 s = 101 mWh and
1276optimistically 16.3 W $\cdot$ 8.13 s = 37 mWh to simulate 10 seconds
1277of biological time. The pessimistic energy per synaptic event (metric
1278defined in \cite{vanalbada2018}) is 21 nJ and the optimistic one 9
1279nJ. These values are upper bounds and the actual energy usage may be
1280lower.
1281
1282\section{Discussion}
1283\label{sec:disc}
1284
1285Our main contribution is the presentation and analysis of methods for
1286creating FPGA-based simulators competitive in speed and energy usage
1287with the state-of-the-art in SNN simulation. We hope that others will
1288create even faster simulators by adopting and refining our algorithms
1289and implementation techniques for their hardware. Some methods are
1290endogenous to our hardware. For example, by running multiple kernels
1291in parallel that communicate with each other via channels, we overlap
1292different phases of the simulation algorithm. This technique has no
1293direct GPU-equivalent as different kernel types cannot communicate. It
1294also relies on the Intel-specific channel extension and may not work
1295well even on other vendors FPGAs. On the other hand, the technique
1296for interleaving synapses to improve banking and reduce conflicts
1297(\cref{sec:disjoint}) is adaptable to non-FPGA hardware. The same goes
1298for the horizon-based transfer algorithm which should suit any device
1299with enough fast local memory.
1300
1301Push-based SNN simulation mandates irregular memory accesses and has
1302low computational intensity; qualities that it shares with many other
1303graph processing problems. This is evidenced by the fact that most of
1304our simulators use less than 10% of the FPGAs DSP resources, while
1305many consume over 80% of its on-chip memory. More or even faster
1306arithmetic resources would be useless. However, more on-chip
1307memory would be very beneficial because we could use the basic spike
1308transfer algorithm and would not need to bother with the more complex
1309horizon algorithm. And higher bandwidth and lower latency memories
1310would allow the simulators to transmit spikes faster. This finding
1311implies that simulators would probably achieve excellent performance
1312with neuron models more complex than LIF -- as long as the
1313interactions between neurons remain the same and as long as their
1314memory consumption does not grow.
1315
1316\subsection{FPGAs for HPC}
1317
1318We believe that FPGAs have unique advantages for HPC; they can be
1319tailored for the problem at hand thanks to their
1320malleability. However, FPGAs and particularily \textit{tools for
1321 designing for FPGAs} have many glaring weaknesses. Unlike compilers
1322for software, which emit the same machine code for the same source
1323code, synthesizers use optimization algorithms based on randomness so
1324good performance is contingent on choosing lucky random seeds. Even if
1325the differences between lucky and unlucky seeds are small, the
1326randomness makes squeezing out the last few percent of performance
1327frustrating. An algorithmic change or parameter tuning causing a small
1328performance improvement may be due to chance. Moreover, a single
1329synthesis can take several hours even on top-of-the-line hardware,
1330compounding this problem. Writing performance-critical code is an
1331experimental process, wherein one needs to test hundreds or thousands
1332of ideas to see what yields the best performance. Long turn-around
1333times slows the process down. Simulators do not help as their
1334performance does not reflect the performance of real hardware. For
1335these reasons, FPGAs are decidedly more complex to develop for than
1336GPUs.
1337
1338What is a good distribution of FPGA resources for HPC? For us, our
1339boards distribution is far from perfect. Our fastest designs used
1340almost all on-chip memory, with plenty of ALUTs, FFs, and DSPs to
1341spare. If our use-case and designs are close to the norm then it would
1342be wise for FPGA vendors to trade-off logic resources for more on-chip
1343memory. And trends in the HPC field indicate that memory -- not
1344arithmetic -- very much is a limiting factor for many
1345problems. However, inevitably, whatever resource distribution the
1346vendor chooses, it will not be ideal for some designs. Optimizing an
1347FPGA so that as many designs as possible can take advantage of as much
1348of its resources as possible seems exceedingly difficult.
1349
1350\subsection{HLS for HPC}
1351
1352In this work we choose HLS over traditional design methods because of
1353its purported productivity advantages. How much performance did HLS
1354cost us and how much longer would it have taken us to implement the
1355simulators in an HDL? The question is an instance of the classic
1356dichotomy between performance and productivity that appears in many
1357corners of computer science. The literature suggests that in general
1358the productivity gains of HLS are large and the performance losses are
1359either non-existent or low \citep{lahti2018, pelcat2016}. However, one
1360can ask whether this holds true for performance-critical design?
1361
1362HLS definitely allowed us to explore algorithmic ideas at a rapid
1363pace. In particular, scheduling stallable loops and interfacing with
1364DDR would have been at least an order of magnitude more work to
1365implement (and debug!) in an HDL. It also helped that most -- but not
1366all -- of the OpenCL code we wrote were runnable verbatim on non-FPGA
1367targets. However, for performance judicious use of compiler
1368directivies is essential, something we struggled with. For example, adding
1369\verb!#pragma disable_loop_pipelining! on a loop in a gmem simulator
1370increased its operating frequency by almost 200 MHz since the compiler
1371could deduce that a memory port would not be shared across loop
1372iterations. On the other hand, removing the directive on a loop in a
1373horiz simulator more than doubled its performance! On several
1374occassions, misplaced \verb!#pragma ivdep! directives caused bugs that
1375were difficult to troubleshoot. Having to ‘‘nudge’’ the compiler via
1376directives to make the right decisions made us feel like we were not
1377fully in control and was at times frustrating.
1378
1379As we have no baseline to compare with, estimating the performance
1380cost of HLS is difficult. Our designs run at over 600 MHz which --
1381while a fair bit lower than the theoretical limit -- is in the upper
1382range of what typical Agilex 7 HDL designs runs at. If we are correct
1383in that performance is mostly bounded by memory resources, then
1384non-optimal performance is due to algorithmic choices and not due to
1385the choice of implementation language.
1386
1387\subsection{Future Research}
1388
1389\begin{figure}
1390 \center
1391 \includegraphics[width=\linewidth]{delays}
1392 \caption{Synaptic delay distributions for excitatory (blue) and
1393 inhibitory synapses (gold). Almost all probability mass lies to
1394 the left of 32 (dashed line).}
1395 \label{fig:delays}
1396\end{figure}
1397
1398Time constraints forced us to leave many ideas for better performance
1399unexplored. We list some of them here.
1400
1401Our results demonstrate that single-precision floating-point is
1402sufficient. Half-precision or some 16-bit fixed-point representation
1403could also be sufficient. If so, four bytes would be enough to
1404represent synapses which would -- more or less -- cut our simulators
1405memory demand in half.\footnote{Assuming we store some bits of the
1406delay and destination neuron index implicity in the index.} And, as the
1407synapses current is sampled from Gaussians with known means, the
1408representations size could perhaps be reduced further by storing the
1409current as \textit{deviations from the mean}. Presumably, fewer bits are
1410needed to represent the deviations accurately since they are rather
1411small.
1412
1413Along the same lines, one could exploit the fact that, as
1414\cref{fig:delays} shows, synaptic delays are not uniformly
1415distributed. Something our algorithms are oblivious to. It means, for
1416example, that the gmem simulators needto use off-chip memory for the
1417spike transfer array. One could perhaps retain the spike transfer
1418array for ‘‘fast’’ synapses with delays shorter than, say, 32 and
1419another more space-efficient format for ‘‘slow’’ synapses with longer
1420delays.
1421
1422As we have emphasized, we cannot safely transfer spikes from two
1423neurons simultaneoulsy as they may write to the same memory
1424locations. But this is only true if they have synapses with the same
1425destination neuron and delay. We can apply the idea from
1426\cref{sec:disjoint} on the ‘‘neuron level’’ and partition the
1427neurons into disjoint classes so that synapses of neurons of
1428different classes never trigger writes to the same memory locations.
1429
1430Our FPGA has four DDR memories each equipped with a 512 bit read/write
1431port. We are likely not utilizing all bandwidth properly since we use
1432the compilers default off-chip memory organization. Plausibly, we
1433could see a substantial performance improvement by carefully laying
1434out off-chip memory manually. For example, by distributing the
1435synaptic data (by far the largest data structure) over the four
1436memories.
1437
1438Another line of research could be to deploy a cluster of FPGAs for SNN
1439simulation. As the communication between the devices likely becomes
1440the bottleneck, it is a much different problem from using one device.
1441
1442\subsection{State-of-the-art}
1443\label{sec:sota}
1444\begin{table*}
1445 \centering
1446 \footnotesize
1447 \begin{tabular}{lllrrr}
1448 \toprule
1449 \textbf{Work} & \textbf{Simulator} & \textbf{Hardware} & \textbf{Node} & \textbf{RTF} & \textbf{Syn.Ev En.}\\
1450 \midrule
1451 \textbf{This work} & \textbf{horiz/mono/s} & \textbf{1 Agilex 7 FPGA} & \textbf{10} & \textbf{0.81} & \textbf{25}\\
1452 \cite{heittmann2022} & IBM INC-3000 & 432 Xilinx XC Z7045 SoC & 28 & 0.25 & 783\\
1453 \cite{kauth2023b} & neuroAIx & 35 NetFPGA SUME & 28 & 0.05 & 48\\
1454 \cite{golosio2021} & NeuronGPU & 1 GeForce RTX 2080 Ti & 12 & 1.06 & 180\\
1455 \cite{golosio2021} & NeuronGPU & 1 Tesla V100 & 12 & 1.64 & -\\
1456 \cite{knight2018} & GeNN & 1 GeForce RTX 2080 Ti & 12 & 1.40 & -\\
1457 \cite{knight2018} & GeNN & 1 Tesla V100 & 12 & 2.16 & 470\\
1458 \cite{vanalbada2018} & SpiNNaker & 217 ASIC & 130 & 20.00 & 5900\\
1459 \cite{rhodes2020} & SpiNNaker & 318 ASIC & 130 & 1.00 & 600\\
1460 \cite{kurth2022} & NEST & 2 AMD EPYC Rome & 14 & 0.53 & 480\\
1461 \bottomrule
1462 \end{tabular}
1463 \caption{State-of-the-art in hardware-accelerated microcircuit
1464 simulation. Node refers to the technology node in nanometers, RTF
1465 how much slower than realtime the simulator runs (lower is
1466 better), and Syn.Ev En. estimated energy consumption per synaptic
1467 event in nano-Joule.}
1468 \label{tab:power}
1469\end{table*}
1470\begin{figure}
1471 \center
1472 \includegraphics[width=\linewidth]{competition}
1473 \caption{RTFs of some SNN simulators}
1474 \label{fig-rtf}
1475\end{figure}
1476\begin{figure}
1477 \center
1478 \includegraphics[width=\linewidth]{energy}
1479 \caption{Energy per synaptic event of some SNN simulators}
1480 \label{fig-energy}
1481\end{figure}
1482
1483While we are far from the first to have investigated FPGA-based SNN
1484simulation, our investigation is qualitatively different from most
1485others. First, our focus is accurate SNN simulation and not solving a
1486specific
1487task. \cite{gupta2020,han2020,li2021,zheng2021,carpegna2022,liu2023,carpegna2024}
1488all implement layered SNNs that excel at classifying images from the
1489MNIST dataset. For them classification accuracy is much more important
1490than simulation speed or energy consumption. Second, many FPGA
1491simulators simulate SNNs much smaller than the Potjans-Diesmann
1492microcircuit or uses non-LIF neurons, making their results
1493incomparable to ours. For example, \cite{pani2017} simulates up to 1,440
1494Izhikevich neurons in real-time and \cite{shama2020} simulates 150
1495Hodgkin-Huxley neurons on a Virtex-2 FPGA. In contrast, NeuroFlow by
1496\cite{cheung2016} simulates up to 600,000 neurons on a cluster of six
1497FPGAs four times slower than real-time. However, it uses a
14981 ms timestemp (ten times larger than the \textit{de facto} standard
14990.1 ms), organizes neurons into two-dimensional grids, and spatially
1500constrains synapses. That is, most synapses connect to the neurons
1501nearest neighbours. \cite{trensch2022} proposes a simulator for a
1502Zynq-7000 SoC and measure its performance on a network of 800
1503excitatory and 200 inhibitory Izhikevich neurons. Their architecture
1504divides the FPGA into 16 identical processing blocks, each supporting
150564 neurons or 1,024 neurons in total per chip. They argue that if a
1506cluster of their devices were deployed, it could simulate the
1507Potjans-Diesmann microcircuit seven times faster than
1508real-time. Though they did not test their claim.
1509
1510The simulators we think are the most similar to ours are: IBM INC-3000,
1511neruoAIx, NeuronGPU, GeNN, SpiNNaker, and NEST. IBM-INC 3000 and
1512neuroAIx are both FPGA clusters, consisting of 432 Xilinx XC boards
1513and 35 NetFPGA SUME boards respectively. IBM-INC 3000 is four
1514times faster than real-time and neuroAIx twenty time faster than
1515real-time \citep{kauth2023,heittmann2022}. GeNN and NeuronGPU are both
1516GPU-based simulators written in CUDA and runs on NVIDIAs line of GPUs
1517\citep{knight2018,golosio2021}. On a Tesla V100 GPU, GeNN simulate
1518the microcircuit at about half the speed of real-time and NeuronGPU
15195% slower than real-time. SpiNNaker is an older dedicated
1520neuromorphic hardware system built out of hundreds of ASICs
1521\citep{furber2013}. NEST is a CPU-based simulator which runs
1522twice the speed of real-time on two AMD EPYC Rome nodes
1523\citep{kurth2022}.
1524
1525\Cref{tab:power,fig-rtf,fig-energy} show how our best simulator stack
1526up against the competition. The energy usage of our fastest
1527simulator compares favorably to other results. Likely
1528partially due to to the Agilex 7’s smaller technology node and to us
1529confining our implementation to one FPGA. \cite{kauth2023b} report a
1530much lower RTF than us, but use 35 boards, each drawing 26.54 W on
1531average, resulting in a higher total energy usage for the same amount
1532of biological time. They, like us, report an ‘‘all-inclusive’’ value
1533for the energy usage so a dedicated platform -- without any unused
1534peripherals -- could consume much less energy. It goes without
1535saying that fairly comparing performance of systems implemented on
1536different architectures, with different design trade-offs, and
1537accuracy constraints is tricky. One system with excellent performance
1538may be inadequate in other regards. For example, less configurability
1539may improve a systems performance, but make it unusable for certain
1540applications. It should be noted that most simulators, unlike ours,
1541replace thalamic spikes with DC input which limits their
1542applicability.
1543
1544
1545
1546
1547
1548
1549
1550
1551%% Fairly comparing performance of systems implemented on different
1552%% architectures, with different design trade-offs, and accuracy
1553%% constraints is tricky. One system with excellent performance may be
1554%% inadequate in other regards. For example, less configurability may
1555%% improve a systems performance, but make it unusable for certain
1556%% applications. Nevertheless, \cref{tab:power,fig-rtf,fig-energy}
1557%% compare our best simulators RTF and energy per synaptic event
1558%% (defined in \cref{sec:results}) with that of some well-known
1559%% simulators from the literature. GeNN and NeuronGPU are two GPU-based
1560%% simulators written in CUDA and runs on NVIDIAs line of GPUs
1561%% \citep{knight2018, golosio2021}. On a Tesla V100 GPU, GeNN simulate
1562%% the microcircuit at about half the speed of real-time and NeuronGPU
1563%% 5% slower than real-time. IBM INC-3000 and neuroAIx are two FPGA
1564%% clusters that run four times faster than real-time and twenty times
1565%% faster than real-time \citep{kauth2023, heittmann2022}. They consist
1566%% of 432 Xilinx XC SoC boards and 35 NetFPGA SUME boards,
1567%% respectively. SpiNNaker is an older dedicated neuromorphic hardware
1568%% system built out of hundreds of ASICs \citep{furber2013}. It should be
1569%% noted that, unlike ours, most of these simulators replace thalamic
1570%% spikes with DC input.
1571
1572%% As \cref{fig-energy} indicates, the energy usage of our best simulator
1573%% compares favorably to published results. This is likely partially due
1574%% to to the Agilex 7’s smaller technology node and to us confining our
1575%% implementation to one FPGA. \cite{kauth2023b} report a much lower RTF
1576%% than us, but uses 35 boards, each of which on average draws 26.54 W
1577%% which results in more energy use for the same amount of biological
1578%% simulation time. They, like us, report an ‘‘all-inclusive’’ value for
1579%% the energy usage so a dedicated platform -- without any unused
1580%% peripherals -- would conceivably use much less energy.
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590%% In addition to the above presented simulators, a number of FPGA
1591%% implementations simulate relatively small networks, which we contend
1592%% is materially different from simulating large networks. The modular
1593%% and efficient FPGA design \cite{pani2017} created simulates networks
1594%% of up to 1,440 Izhikevich neurons in real-time with a time-step of 0.1
1595%% ms. However, their design assumed a fuly-connected connectome and did
1596%% not model multapses nor spike transmission over synapses. The
1597%% simulator created by \cite{shama2020} simulated a 150-neuron
1598%% Hodgkin-Huxley (HH) SNN on a Virtex-2 FPGA. While the HH neuronal
1599%% model is more complicated than the LIF model, the authors, like
1600%% \cite{pani2017}, did not account for spike transmission delays in
1601%% their simulator.
1602
1603%% Other implementations are not for simulation per se, but for executing
1604%% a specific task. Those include the layered FPGA-based
1605%% SNN-implementations presented in \cite{gupta2020, han2020, li2021,
1606%% zheng2021, carpagena2022, liu2023, carpegna2024} which classify
1607%% images from the MNIST dataset. They all use layered SNNs which are
1608%% materially different from disorganized ones.
1609
1610%% \section{State-of-the-art}
1611
1612%% \begin{table*}[t]
1613%% \centering
1614%% \footnotesize
1615%% \begin{tabular}{llr}
1616%% \toprule
1617%% \textbf{Article} & \textbf{Board} & \textbf{MNIST accuracy}\\
1618%% \midrule
1619%% \cite{zheng2021} & Xilinx ZCU102 & 90.53%\\
1620%% \bottomrule
1621%% \end{tabular}
1622%% \caption{SNNs for classification on one FPGA}
1623%% \end{table*}
1624
1625%% \begin{table*}
1626%% \centering
1627%% \footnotesize
1628%% \begin{tabular}{lllrr}
1629%% \toprule
1630%% \textbf{Article} & \textbf{Board} & \textbf{Neuron model} & \textbf{Neuron cnt.} & \textbf{Time step}\\
1631%% \midrule
1632%% \cite{pani2017} & Xilinx XC 6VLX240T & Izhikevich & 1,440 & 0.1 ms\\
1633%% \bottomrule
1634%% \end{tabular}
1635%% \caption{SNNs for simulation of small networks on one FPGA}
1636%% \end{table*}
1637
1638
1639%% \begin{table*}[t]
1640%% \centering
1641%% \footnotesize
1642%% \begin{tabular}{llllrrr}
1643%% \toprule
1644%% \textbf{Article} & \textbf{Hardware} & \textbf{Neuron} & \textbf{Connectome} & \textbf{Neurons} & \textbf{Synapses} & \textbf{Sim. Speed}\\
1645%% \midrule
1646%% \cite{han2020} & Single FPGA & LIF & ?? & ?? & ?? & ?\\
1647%% \cite{trensch2022} & Single FPGA & IZ & ?? & 1,000 & ?? & ??\\
1648%% \cite{carpegna2024} & Single FPGA & LIF & Layered & & & n/a\\
1649%% \bottomrule
1650%% \end{tabular}
1651%% \caption{Overview of simulators for disorganized SNNs}
1652%% \end{table*}
1653
1654%% In this section we gauge the state-of-the-art for SNN-simulation on
1655%% FPGAs and explain how it relates to the implementation we
1656%% propose. Characterizing the state-of-the-art is challenging due to
1657%% differences in hardware, networks intended purpose, implementation
1658%% techniques, correctness constraints, simulation resolution,
1659%% etc. Performance is unlikely to scale linearly with network size; a
1660%% result from a small network cant easily be compared with a result
1661%% from a larger network. For instance, memory bottlenecks can cause
1662%% catastrophic degradations in performance as memory demand
1663%% increases. And, as discussed in \cref{sec:topo}, topology is extremely
1664%% important so implementations for layered SNNs cannot be fruitfully
1665%% compared with simulators for more brain-like, disorganized SNNs even
1666%% when targeting the same hardware.
1667
1668%% Some simulators are designed for FPGA clusters, or other custom
1669%% large-scale architetures, which for obvious reasons makes it difficult
1670%% to compare their perfromance with simulators utilizing a single board
1671%% (such as ours). An example of an early simulator for FPGA clusters is
1672%% NeuroFlow \citep{cheung2016}. It simulated up to 600,000 neurons on
1673%% six FPGAs only about four times slower than real time. However, the
1674%% time step was set to 1 ms (ten times larger than the \textit{de facto}
1675%% standard 0.1 ms used in many other simulators), neurons were organized
1676%% into two-dimensional grids, and synapses were spatially
1677%% constrained. That is, most synapses connected a neuron with its
1678%% immediate north, south, east, or west neighbour. neuroAIx is a more
1679%% recent FPGA cluster-based simulator \citep{kauth2023}. It uses 35
1680%% NetFPGA SUME (SoC based on Virtex 7) nodes and simulates the
1681%% Potjans-Diesmann microcircuit 20 times faster than biological
1682%% realtime. Like many of its predecessors, it approximates thalamic
1683%% input as direct current. Another large-scale architecture is the
1684%% neural supercomputer IBM INC-3000 \citep{heittmann2022}. It uses a
1685%% whop** 432 Xilinx XC Z7045 SoC and simulates Potjans-Diesmanns
1686%% microcircuit four times faster than realtime.
1687
1688
1689
1690%% For completeness sake, we also survey some GPU-based
1691%% SNN-simulators. GeNN presented in \cite{knight2018} simulates the
1692%% Potjans-Diesmann microcircuit at half the speed of real-time on a
1693%% Tesla V100 GPU. The GPU library NeuronGPU, presented in
1694%% \cite{golosio2021}, is capable of simulating the same network about
1695%% 5% slower than realtime.
1696
1697%% \cite{han2020} implemented an LIF SNN on an Xilinx ZC706 FPGA for
1698%% classifying images from the MNIST dataset. They found that the network
1699%% could classify 161 frames per second with an accuracy of 97%, only
1700%% drawing 0.477W of power. For updating neurons they used a novel hybrid
1701%% algorithm that was neither fully synchronous nor fully
1702%% asynchronous. They benchmarked their FPGA implementation against an
1703%% SNN implemented in PyTorch running on an NVIDIA P100 GPU. Since
1704%% asynchronous simulation is not suitable for GPUs, it used a
1705%% synchronous update algorithm instead. While the GPU implementation was
1706%% faster than the FPGA implementation it used almost ten times as much
1707%% power per image. As the focus of their research was classification
1708%% accuracy rather than \textit{simulation accuracy} it is not directly
1709%% comparable to ours. Furthermore, their network was an order of
1710%% magnitude smaller than ours, all their synapses were non-recurrent,
1711%% and their neurons organized into four fully-connected layers.
1712
1713%% \cite{trensch2022} designed a simulator for a Zynq-7000 SoC
1714%% device. They measured its performance on a two-population
1715%% Izhikevich-type neurons containing 800 excitatory and 200 inhibitory
1716%% neurons. Their architecture divides the available FPGA into 16
1717%% identical processing blocks which each supports 64 neurons. Thus, in
1718%% total it supports up to 1,024 neurons per chip. The authors argued
1719%% that if a cluster of their devices were deployed, it could simulate
1720%% the Potjans-Diesmann microcircuit seven times faster than
1721%% realtime. Though this claim was not tested. Unlike our simulator,
1722%% theirs do not account for thalamic input.
1723
1724
1725\section*{Author Contributions}
1726BAL and AP designed the study. BAL implemented the SNN framework and
1727performed the experiments. BAL and AP analyzed the results and
1728co-wrote the paper.
1729
1730\section*{Funding}
1731This work was supported by the Swedish Research Council through the
1732project ‘‘Building Digital Brains’’ (grant reference 2021-04579).
1733
1734\section*{Acknowledgments}
1735TODO
1736
1737
1738\section*{Supplemental Data}
1739 \href{http://home.frontiersin.org/about/author-guidelines#SupplementaryMaterial}{Supplementary Material} should be uploaded separately on submission, if there are Supplementary Figures, please include the caption in the same file as the figure. LaTeX Supplementary Material templates can be found in the Frontiers LaTeX folder.
1740
1741\section*{Data Availability Statement}
1742
1743All source code will eventually be available under a permissible Open
1744Source license.
1745
1746
1747\bibliographystyle{Frontiers-Harvard}
1748\bibliography{frontiers}
1749
1750\end{document}