Fast Algorithms for Spiking Neural Network Simulation with FPGAs

Björn A. Lindqvist and Artur Podobas
KTH Royal Institute of Technology
Stockholm, Sweden

Abstract

Using OpenCL-based high-level synthesis, we create a number of spiking neural network (SNN) simulators for the Potjans-Diesmann cortical microcircuit for a high-end Field-Programmable Gate Array (FPGA). Our best simulators simulate the circuit 25% faster than real-time, require less than 21 nJ per synaptic event, and are bottle-necked by the device’s on-chip memory. Speed-wise they compare favorably to the state-of-the-art GPU-based simulators and their energy usage is lower than any other published result. This result is the first for simulating the circuit on a single hardware accelerator. We also extensively analyze the techniques and algorithms we implement our simulators with, many of which can be realized on other types of hardware. Thus, this article is of interest to any researcher or practitioner interested in efficient SNN simulation, whether they target FPGAs or not.
Keywords: cortical microcircuit, fpga, hls, hpc, opencl, simulation, spiking neural networks

1 Introduction

natkey1 outline what they see as the four biggest challenges for human brain simulation; scale, complexity, speed, and integration. Scale refers to the enourmous size of the brain – billions of neurons and trillions synapses – which is difficult to simulate at acceptable speeds even for the fastest supercomputers. There are two reasons for this difficulty. First, the conceptual mismatch between how brains and computers operate; the former can be viewed as collections of billions of processing nodes, communicating with sparse events called spikes (ghosh2009third), while the latter is composed of imperative programs, vulnerable to the von Neumann bottleneck (efnusheva2017survey), and executed on general-purpose devices (GPUs or GPUs). Power in-efficiency is the second reason. General-purpose devices are jacks of all trades and can execute a wide variety of workloads, ranging from word processing to scientific simulations. Their generality comes at a high price and executing a single instruction consumes up to three orders of magnitude more energy than the computation itself (jouppi2018motivation). This causes brain simulations to consume far more energy than the roughly 20 Watts a human brain draws (versace2010brain). Finally, the 2004 end of Dennard’s scaling (bohr200730), and the impending termination of Moore’s law (theis2017end), forces researchers to reconsider how to compute efficiently in a post-Moore future. Among the more promising options in a post-Moore world is the use of reconfigurable systems such as Field-Programmable Gate Arrays (FPGAs) (kuon2008fpga).

FPGAs, along with their siblings, Coarse-Grained Reconfigurable Systems, CGRAs (podobas2020survey), belong to the reconfigurable family of computing devices. Unlike traditional general-purpose processors (CPUs and GPUs), their underlying compute fabric is composed of a large number of reconfigurable blocks of different types. The most common blocks are look-up tables (LUTs, often several hundred thousands), digital signal processing (DSPs – capable of tens of TFLOP/s), or on-chip random-access memory (BRAM – tens-to-hundreds of MB) (langhammer2021stratix; murphy2017xilinx). These resources allow designers to create custom hardware that sacrifice generality for better performance and lower energy consumption than general-purpose devices, as well as transcending the latters’ limitations (i.e., the mentioned von Neumann-bottleneck). Importantly, these devices can, and often already are (sano2023essper; meyer2023multi; boku2022cygnus), live side-by-side in HPC nodes and can be reconfigured before runtime (vipin2018fpga): when the user is running a brain simulation, the FPGA will be configured as an efficient brain simulator, while for a different application a different accelerator will be used. In short, FPGA facilitates the use of special hardware accelerators during application runtime but is general enough so that different accelerators can be configured between applications. This makes them an attractive choice for neuroscience simulation since they can be reconfigured for different neuron, synapse, and axon models, which dedicated neuromorphic ASIC hardware, whose silicon is immutable, cannot be.

Historically, FPGAs have been designed using low-level hardware description languages (HDLs) such as VHDL and Verilog (perry2002vhdl). These languages have a steep learning curve and require specialist knowledge to be comfortable with. However, with the increased maturity of High-Level Synthesis (HLS) tools in the last two decades (nane2015survey), there has been a resurgence of interest in using FPGAs for HPC. HLS allow designers to describe hardware in relatively high-level languages such as C and C++ whose learning curves are shallower (podobas2017evaluating; zohouri2016evaluating). For example, FPGAs have been used to accelerate Computational Fluid Dynamics (karp2021high; faj2023scalable), Quantum circuit simulations (podobas2023q2logic; aminian2008fpga), Molecular Dynamics (sanaullah2018unlocking), N-Body systems (del2018scalable; menzel2021strong; huthmann2019scaling), and much more, demonstrating advantages over alternative solutions.

In this work, we use of HLS to design simulators for the Potjans-Diesmann cortical microcircuit (potjans2014). While there is ample prior work on FPGA-based neuromorphic systems (see LABEL:sec:sota for related work), our system is (to the best of our knowledge) the most energy-efficient simulator of the Potjans-Diesmann circuit in existence (25 nJ/event), while reaching a faster-than-realtime ( $\approx$ 1.2x) simulation speed on a single FPGA. We use the Intel’s OpenCL SDK for FPGA HLS toolchain (czajkowski2012opencl) to design our simulators, but our designs are modular enough to easily be ported to other HLS-based systems (e.g., Vivado (o2014xilinx) or OneAPI), and other FPGAs. Our contributions are:

•

The first simulators of the previously mentioned circuit on a single FPGA, running faster than real-time.
•

The most energy-efficient simulators for the circuit when measured by energy per synaptic event.
•

The presentation and analysis of the algorithms, thought processess, trade-offs, and lessons learned, while designing these simulators.
•

An empirically motivated analysis on what hardware features are required to simulate the circuit even faster than what we are capable of.

The rest of this article is structured as follows. In section 2, we discuss SNNs in general and the microcircuit in particular. We explain how FPGAs work and we briefly introduce the HLS design methodology. In LABEL:sec:ssn-sim we discuss SNN simulation and present the algorithms and ideas underlying our simulators. The utility of many of the ideas have already been demonstrated in other parts of Computer Science, but not in connection with SNN simulation. Hence, we believe they deserve a thorough treatment. We evaluate many different variants and parametrizations of our simulators in LABEL:sec:results. Finally, in LABEL:sec:disc we put our results in perspective and compare them with the state-of-the-art.

2 Material and Methods

We begin with an overview of spiking neural networks (SNNs) before discussing the Potjans-Diesmann cortical microcircuit, an SNN for simulating a small part (microcircuit) of the mammalian brain. In section 2.3 we explain how FPGAs works and what makes them different from from conventional hardware. In sections sections 2.4, 2.5 and LABEL:sec:ocl-fpga we introduce HLS and how we use OpenCL for HLS.

2.1 Spiking Neural Networks

An SNN is an artificial neural network (ANN) that transfers signals in time-dependent bursts, i.e. spikes. Unlike other ANNs, SNNs are designed with biological plausibility in mind, making them useful for neuroscience. SNNs are usually modelled as directed (multi-)graphs, where vertices represent neurons and edges synaptic connections between neurons. Neurons have a membrane potential that varies over time. When the potential exceeds a threshold the neuron discharges – spikes – and sends current via its synapses to its neighbours which they receive after a synapse-specific delay. The amount of current as well as the transfer time is synapse-specific (han2020). The neuron’s statefulness and the non-differentiable, discontinuous signal transfer function are two fundamental aspects distinguishing SNNs from other ANNs. While these basic operating principles are enough to describe most SNNs, SNNs vary in neuron model and other parameters. In this work, we use the basic leaky integrate-and-fire (LIF) neuron model, defined by

RC\frac{\mathrm{d}u}{\mathrm{d}t}=u_{\mathrm{rest}}-u(t)+RI(t).

(1)

The equation describes the membrane potential over time. The variables $R$ and $C$ are the resistance and capacitance of the membrane, $u(t)$ its potential at time $t$ , $I(t)$ the amount of current it receives at time $t$ from its neighbours, and $u_{\mathrm{rest}}$ its resting potential. With $\tau_{m}=RC$ , and $u(0)=u_{\mathrm{rest}}=0$ the solution to the equation is

u(t)=RI(t)-RI(t)e^{-\frac{t}{\tau_{m}}}.

(2)

Solving the equation using forward Euler produces a recurrent, discrete representation of the neuron’s potential over time. With $\Delta t\to 0$ ,

	$\displaystyle\tau_{m}\frac{u(t+\Delta t)-u(t)}{\Delta t}=-u(t)+RI(t)$		(3)
	$\displaystyle\implies u(t+\Delta t)=u(t)+\frac{\Delta t}{\tau_{m}}\left(-u(t)+% RI(t)\right)$		(4)
	$\displaystyle\implies u(t+\Delta t)=\left(1-\frac{\Delta t}{\tau_{m}}\right)u(% t)+\frac{\Delta tR}{\tau_{m}}I(t).$		(5)

The solution forms the basis of step-wise simulation of LIF neurons.

After a neuron spikes it enters its refractory state. Its potential becomes fixed at $u_{\mathrm{reset}}$ and it ceases to respond to stimuli for a duration controlled by the $\tau_{\mathrm{ref}}$ parameter. Usually, the refractory period is in the order of milliseconds and for simplicity one sets $u_{\mathrm{reset}}=u_{\mathrm{rest}}=0$ . With $r(t)$ denoting how long the neuron will say refractory at time $t$ , $u_{\mathrm{thr}}$ the neuron’s spiking threshold, and $\Delta t$ arbitrarily set to one, we can incorporate refractoriness into the LIF model:

	$\displaystyle r(t+1)$	$\displaystyle=\begin{cases}\tau_{\mathrm{ref}}&\mathrm{if}\,u(t+1)\geq u_{% \mathrm{thr}}\\ r(t)-1&\mathrm{elseif}\,r(t)>0\\ 0&\mathrm{otherwise}\\ \end{cases}$		(6)
	$\displaystyle u(t+1)$	$\displaystyle=\begin{cases}\left(1-\frac{1}{\tau_{m}}\right)u(t)+\frac{R}{\tau% _{m}}I(t)&\mathrm{if}\quad r(t)=0\\ 0&\mathrm{otherwise}\end{cases}$		(7)

2.1.1 Network topology

Refer to caption — Figure 1: A fully-connected neural network with two hidden layers, two input neurons, and three output neurons

A major difference between conventional ANNs and SNNs is that the former often are layered; all neurons in one layer only receives inputs from neurons in the previous layer and only sends outputs to neurons in the following layer. If all neurons in a layer send output only to all neurons in the following layer the layer is said to be fully-connected and its signal-transfer can be represented as a matrix-vector multiplication. SNNs can similarily be layered and it has been found to be a good approach for classification (zheng2021). But it is not suitable for simulation as the neurons in real brains are not organized into fully-connected layers. Instead, their topology is “chaotic” and full of recurrent connections, self-loops (autapses), and multiple edges (multapses). This has far-reaching consequences for what data structures are appropriate for SNNs. An adjacency matrix, for example, is not enough to represent their topological richness.

2.2 Potjans-Diesmann’s Microcircuit

Name	Value	Description (Unit)
$\Delta t$	0.1	Time step duration (ms)
$C_{m}$	250	Membrane capacity (pF)
$\tau_{m}$	10	Membrane time constant (ms)
$\tau_{\mathrm{ref}}$	2	Refractory period (ms)
$\tau_{\mathrm{syn}}$	0.5	Postsyn. current time constant (ms)
$u_{\mathrm{rest}}$	-65	Resting and reset potential (mV)
$u_{\mathrm{thr}}$	-50	Spiking threshold (mV)
$v_{\mathrm{th}}$	8	Thl. neurons’ mean spiking rate (Hz)
$\omega_{\mathrm{ext}}$	0.15	Thl. spikes amplitude (mV)

Table 1: Microcircuit’s general and simulation parameters

$i$	Pop.	$N_{i}$	$K$	$u_{\mathrm{init}}$	$\omega_{i}$	$\delta_{i}$
1	L23/exh	20 683	1 600	$\mathcal{N}(-68.28,5.36)$	$\mathcal{N}(0.15,0.015)$	$\mathcal{N}(1.5,0.75)$
2	L23/inh	5 834	1 500	$\mathcal{N}(-63.16,4.57)$	$\mathcal{N}(-0.6,0.06)$	$\mathcal{N}(0.75,0.325)$
3	L4/exh	21 915	2 100	$\mathcal{N}(-63.33,4.74)$	$\mathcal{N}(0.15,0.015)$	$\mathcal{N}(1.5,0.75)$
4	L4/inh	5 479	1 900	$\mathcal{N}(-63.45,4.94)$	$\mathcal{N}(-0.6,0.06)$	$\mathcal{N}(0.75,0.325)$
5	L5/exh	4 850	2 000	$\mathcal{N}(-63.11,4.94)$	$\mathcal{N}(0.15,0.015)$	$\mathcal{N}(1.5,0.75)$
6	L5/inh	1 065	1 900	$\mathcal{N}(-61.66,4.55)$	$\mathcal{N}(-0.6,0.06)$	$\mathcal{N}(0.75,0.325)$
7	L6/exh	14 395	2 900	$\mathcal{N}(-66.72,5.46)$	$\mathcal{N}(0.15,0.015)$	$\mathcal{N}(1.5,0.75)$
8	L6/inh	2 948	2 100	$\mathcal{N}(-61.43,4.48)$	$\mathcal{N}(-0.6,0.06)$	$\mathcal{N}(0.75,0.325)$

Table 2: Population-specific parameters. The first four columns denote the index,

i

, the name, the size,

N_{i}

, and the number of thalamic connections,

K_{i}

, of the eight populations. The last three columns denote the Gaussians from which the neurons’ initial potential (mV), the neurons’ synapses amplitudes (mV), and delays (ms) of excitatory postsynaptic potential are sampled from. However, synapse amplitudes from population L23/exh to L3/exh are sampled from

\mathcal{N}(0.3,0.03)

and not

\mathcal{N}(0.15,0.015)

	L23/exc	L23/inh	L4/exc	L4/inh	L5/exc	L5/inh	L6/exc	L6/inh
L23/exc	0.1009	0.1689	0.0437	0.0818	0.0323	0.0000	0.0076	0.0000
L23/inh	0.1346	0.1371	0.0316	0.0515	0.0755	0.0000	0.0042	0.0000
L4/exc	0.0077	0.0059	0.0497	0.1350	0.0067	0.0003	0.0453	0.0000
L4/inh	0.0691	0.0029	0.0794	0.1597	0.0033	0.0000	0.1057	0.0000
L5/exc	0.1004	0.0622	0.0505	0.0057	0.0831	0.3726	0.0204	0.0000
L5/inh	0.0548	0.0269	0.0257	0.0022	0.0600	0.3158	0.0086	0.0000
L6/exc	0.0156	0.0066	0.0211	0.0166	0.0572	0.0197	0.0396	0.2252
L6/inh	0.0364	0.0010	0.0034	0.0005	0.0277	0.0080	0.0658	0.1443

Table 3: Probability that a random neuron in the population specified by the rows is connected to a random neuron in the population specified by the columns.

In 2014 potjans2014 compiled the results of a dozen empirical studies to create a full-scale spiking neural network microcircuit of one cortical column in the mammalian early sensory cortex. The microcircuit covers 1 $\mathrm{mm}^{2}$ of the cerebral cortex and consists of 77,169 LIF neurons grouped into four neocortical layers;¹¹1These layers are not analoguous to layers in conventional ANNs. L23, L4, L5, and L6.²²2Layer L2 and L3 merged and L1 omitted. Each layer is subdivided into one excitatory population that increases neural activity and one inhibitory population that decreases it. Around 300 million synapses connect the neurons.

The microcircuit is a balanced random network so that neural activity is balanced by excitatory neurons that inreases activity and inhibitory ones that dampen it (brunel2000). Connectivity and features are sampled from parametric probability distributions, rather than set explicitly. Table 2 and table 3 specify these distributions’ parameters. For example, the initial potential of every inhibitory neuron in L23 is set by sampling a Gaussian with mean -63.16 mV and standard deviation 4.57 mV and the expected number of synapses from population L23/inh to population L5/exc is $5,834\cdot 4,850\cdot 0.0755\approx 2$ million. Synapses are sampled with replacement – multiple synapses can connect the same neuron pair. While the neurons are arranged in in terms of neocortical layers, the layers’ connection probabilities show that the network’s topology is not layered; neurons in most populations can connect to neurons in any of the other populations.

In addition to the synapses within the cortical column, the circuit receives spikes from external neurons – thalamic input. Column $K$ in table 2 specifies the number of thalamic neurons a given population’s neurons receive spikes from, and parameter $v_{\mathrm{th}}$ in table 1 how frequently thalamic neurons spike.³³3potjans2014 set the parameter to 15 Hz, but we use the NEST model as a baseline, where it is set to 8 Hz. The expected number of thalamic spikes received per second by all neurons in a population is $v_{\mathrm{th}}K$ . For example, neurons in population L23/exc receive about $8\cdot 1,600=12,800$ thalamic spikes per second. The amplitude of all thalamic synapses is fixed at $\omega_{\mathrm{exh}}=0.15$ mV. As thalamic spikes can be computationally expensive to simulate, potjans2014 suggest approximating them with constant direct current injected at a rate of $v_{\mathrm{th}}K\omega_{\mathrm{exh}}\tau_{\mathrm{syn}}$ mV per second. In our simulator we model thalamic spikes, however.

The synaptic parameters are also scaled. The synapse amplitude by $w_{f}$ which is a function of the membrane time constant, $\tau_{m}$ , membrane capacity, $C_{m}$ , and the postsynaptic time constant, $\tau_{\mathrm{syn}}$ , that maps postsynaptic potential to postsynaptic current:⁴⁴4See hanuschkin2010 for the derivation.

$\displaystyle d$	$\displaystyle=\tau_{\mathrm{syn}}-\tau_{m}$	(8)
$\displaystyle p$	$\displaystyle=\tau_{\mathrm{syn}}\tau_{m}$	(9)
$\displaystyle q$	$\displaystyle=\tau_{m}/\tau_{\mathrm{syn}}$	(10)
$\displaystyle w_{f}$	$\displaystyle=\frac{C_{m}d}{p(q^{\tau_{m}/d}-q^{\tau_{\mathrm{syn}}/d})}% \approx 585$	(11)

The constants $p_{22}$ and $p_{11}$ define the membranes’ and presynaptic currents’ decay rate:

	$\displaystyle p_{11}$	$\displaystyle=\exp(-\Delta t/\tau_{\mathrm{syn}})\approx 0.82$		(12)
	$\displaystyle p_{22}$	$\displaystyle=\exp(-\Delta t/\tau_{m})\approx 0.99$		(13)

The injection of the presynaptic current is scaled by $p_{21}$ :

$\displaystyle\beta$	$\displaystyle=\tau_{\mathrm{syn}}\tau_{m}/(\tau_{m}-\tau_{\mathrm{syn}})$	(14)
$\displaystyle\gamma$	$\displaystyle=\beta/C_{m}$	(15)
$\displaystyle p_{21}$	$\displaystyle=p_{11}\gamma(\exp(\Delta t/\beta)-1)\approx 0.00036$	(16)

The subscripts match those found in the source code for the NEST simulator (plesser2015) and have no deeper meaning here.⁵⁵5https://github.com/nest/nest-simulator/blob/aa9907a5d5a7916eeeeca00c2e6584202702eb2a/models/iaf_psc_exp.cpp Taken together, this gives us the following discrete recurrences for the step-wise update of membrane potential, $u_{t}$ :

u_{t+1}=\begin{cases}u_{\mathrm{reset}}&\text{if}\quad r_{t}>0\\ p_{22}u_{t}+I_{t}p_{21}&\text{otherwise}\\ \end{cases},

(17)

presynaptic current, $I_{t}$ :

I_{t+1}=p_{11}I_{t}+T_{t}w_{f}\omega_{\mathrm{exh}},

(18)

and refractoriness, $r_{t}$ :

r_{t+1}=\begin{cases}\tau_{\mathrm{ref}}&\text{if}\quad u_{t+1}\geq u_{\mathrm% {thr}}\\ r_{t}-1&\text{elseif}\quad r_{t}>0\\ 0&\text{otherwise}\end{cases}.

(19)

The variable $T_{t}$ denotes the number of thalamic spikes received by the presynapse at time $t$ and is modelled as a Poisson distributed random variable with mean $v_{\mathrm{th}}K\Delta t$ .

Having presented the theoretical foundations for LIF SNNs and the specifics of the Potjans-Diesmann microcircuit, we now present the technologies implement our simulator with. We return to SNN simulation in LABEL:sec:ssn-sim where we both delve deep into simulation methods and present our simulators.

2.3 Field-Programmable Gate Arrays

An FPGA is a type of reprogrammable integrated circuit. First marketed by Altera and Xilinx in the 1980’s, FPGAs have found uses in many niches of the electronics industry; in avionics, in telecommunications, and in VLSI design because of their unique blend of performance, flexibility, and development costs (trimberger2015). FPGAs are not as performant as Application-Specific Integrated Circuits (ASICs), but ASICs are extremely expensive to develop which makes them cost-prohibitive unless the number of units produced runs in the tens of millions. Furthermore, ASICs are not reprogrammable and thus suitable only for the specific tasks they were developed for. Central Processing Units (CPUs), on the other hand, are flexible and inexpensive, but have low performance. Graphics Processing Units (GPUs) sit between CPUs and ASICs on the flexibility-performance spectrum. While GPUs can run any computation, they generally only excel at highly regular, numerically intensive computations. In particular, they do not handle divergent control flow well.

FPGAs trade performance for flexibility in a different manner than CPUs, GPUs, and ASICs. They consist of configurable blocks organized in a grid-like fabric and each block can be configured to compute a small and specific function. Like the logical and of two one-bit signals or comparing two eight-bit numbers. The number of configurable blocks varies enormously from one FPGA to another; high-end FPGAs can contain hundreds of thousands or even millions of blocks. For performance-sensitive workloads, FPGAs’ advantage is the lack instruction processing overhead. CPUs and GPUs load programs from memory and decode and process their instructions one after the other, while FPGAs merely pass data through a pre-configured, fixed-function circuit, similar to how ASICs operate.⁶⁶6It is of course true that superscalar and pipelined processors can handle many instructions in parallel. But the point stands; the overhead caused by the need to decode and dispatch instructions is large. However, the reprogrammability of FPGAs comes at a great cost and their raw performance cannot rival that of ASICs or GPUs unless algorithms are specifically designed for them. They also run at lower clock frequencies than comparable CPUs from the same generation. For example, our simulators run at around 600 MHz,⁷⁷7However, langhammer2023 ran a softcore microprocessor at over 770 MHz on the same FPGA which is close to the theoretical limit. while Intel Core i9 processors operate at over 3 GHz. To stay competitive, FPGA implementations need to carry out much more work per clock cycle than equivalent CPU implementations. These potential drawbacks aside, researchers have deployed FPGAs for performance sensitive workloads, with promising results (nguyen2022).

Bitstreams are files that configure FPGA blocks and specialized tools write them to the FPGA’s fabric. Bitstreams are to FPGAs what machine code are to microprocessors. Tools that take descriptions of the digital circuit the FPGA should implement and produce bitstreams are called synthesizers. The descriptions are usually written in hardware description languages (HDLs), but many tools support higher-level languages as well.

A configurable block is a logic block that functions as a small combinational circuit that computes the boolean function it was configured for. Depending on configuration, the same logic block can serve as an and-gate, or-gate, xor-gate, etc. Memory embedded adjacent to the block stores its configuration and the block’s output is retrieved from this memory. Because the blocks look up their results from memory they are called look-up-tables (LUTs). Figure 2 shows a LUT for a two-input and-gate built with a multiplexer and four memory bits. Flip-flops (FFs) are the FPGA’s basic memory blocks and work as small and fast RAMs. Each FF stores one bit, but multiple FFs can be combined into registers to store larger datums.

In addition to configuring the FPGA’s logic and memory blocks, bitstreams define how the blocks should be connected through the FPGA’s interconnect network. This is called routing and is one of the synthesizer’s most important tasks. Routing is critical to the design as the interconnect occupies most of the FPGA’s fabric (betz1999). Good routing should minimize the total length of the interconnect, the number of blocks required, and the length of the longest path connecting two blocks as the operating frequency of the design is bounded by this length. Routing is a challenging problem in computer science and finding high-quality routing of large designs is both difficult and time-consuming.

Due to their reconfigurability, digital logic implemented in LUTs is slower and requires more components than equivalent logic implemented in non-reconfigurable ASIC gates. Memories built using FFs have lots of overhead because FFs only store one bit. Therefore, modern FPGAs come with non-reconfigurable blocks for arithmetic and storage. One can view these blocks as small ASICs embedded in the FPGA that the designer connects via the configurable blocks. For example, our Agilex 7 FPGA have five different types of blocks; 487 200 Adaptive Logic Modules (ALMs) which functions as LUTs, 1 948 800 flip-flops, 7 110 M20K RAM blocks, two 18.432 Mb eSRAM blocks, and 4 510 DSPs. We describe these blocks in the following sections.

2.3.1 Adaptive Logic Module

The Adaptive Logic Module (ALM) in figure 3 is the basic building block of Intel’s family of FPGAs. Consisting of one fractured eight-input-LUT, four FFs, and two full-adders, it is a versatile multi-purpose block and can – depending on its configuration – compute two four-input boolean functions, one six-input boolean function, or perform four-bit addition with carry. It can also serve as a four-bit memory.

2.3.2 Digital Signal Processing

The Digital Signal Processing (DSP) block contains functions for multiplication, addition, subtraction, and accumulation. The block functions as the FPGA’s arithmetic logic unit (ALU). Agilex 7 has two types of DPSs; one for integer arithmetic and one for IEEE 754 floating-point arithmetic in single- and half-precision modes. Pipeline registers organized into three stages are contained within the DSP. Data can be routed through one or more of the stages or bypassed completely to achieve a given latency. This can be useful if one operand is available multiple cycles before the other.

2.3.3 Memory Hierarchy

In processor-based hardware, registers, caches, and main memory is organized into a fixed hierarchy. Applications must be structured around the memory hierarchy to optimally use it. Typically, data transfer between the hierarchy’s levels is implicit and not directly under the programmer’s control. On an FPGA the designer constructs the memory hierarchy from the available memory blocks and has fine-grained control over exactly where data is stored. This allows the memory system to be tailored to the application rather than the other way around.

The FPGA’s memory is on-chip if it is embedded in the FPGA fabric itself and off-chip otherwise. On-chip memory is much faster and smaller than off-chip memory. The Agilex 7 has three types of on-chip memory; Memory Logic Array Blocks (MLABs), M20Ks, and embedded SRAM (eSRAM). Scalars and shallow FIFOs are often stored in MLABs, small arrays and caches in M20Ks, and larger buffers in eSRAM. MLABs are not memory blocks per se, but instead a technique for using ALMs as memory. The Agilex 7 can combine ten unused ALMs into one 640-bit register. As many FPGA designs do not use all logic blocks, repurposing them can be very useful. MLABs have very low laencies since they are close to the logic that use them. Xilinx FPGAs implement the same concept under the name distributed RAM or LUTRAM. On-chip M20Ks are also known as block RAM (BRAM) and are much larger than MLABs. The Agilex 7 contains around seven thousand M20K each of which can hold 20 kbit of data, as their name suggets. In total, about 139 Mbit. The board also has two 36 Mbit embedded SRAM blocks. These have high bandwidth and high random transaction rates and compliment the other on-chip memory blocks. Unfortunately, the Intel OpenCL FPGA compiler cannot infer eSRAM so they are unused in this work. On the Agilex 7 four 8 GB DRAM sticks constitute the board’s off-chip memory. Accessing off-chip memory takes much longer than accessing on-chip memory, with access times measured in the hundreds of cycles.

In addition to controlling the types of memory used, the designer can also configure individual memory blocks. Multipum**, for example, can sometimes double memory throughput at the expense of significantly lowering the design’s operating frequency. Banking can increase expected memory access concurrency, but also increase stalls due to conflicts so the technique is best reserved for evenly distributed data.

2.4 High-Level Synthesis

Designers traditionally design circuits in Hardware Description Languages (HDLs) such as VHDL and (System) Verilog. With these the designer specifies the behaviour of all the circuit’s logic gates and flip-flops. The result is a Register-Transfer Level (RTL) design, so called because it models the register-to-register transfer the circuit’s signals. A synthesizer takes the RTL design and produces a low-level representation (netlist) of it which can be lowered further to create an ASIC or transformed into an FPGA bitstream. The synthesizer’s job is far from straight-forward and it must – among other things – place every component of the circuit on a two-dimensional grid and ensure that it can operate with the desired clock frequency.

While HDLs offer a great deal of control over the resulting circuit, they are low-level and lack support for many high-level programming constructs so using them can be tedious and error-prone. It often makes sense to work in a higher-level language instead. That workflow is called High-Level Synthesis (HLS) and is supported by many tools. For example, Intel’s FPGA software generates synthesizable Verilog code from designs coded in the imperative C-like language OpenCL.

The main worry of using HLS is that the hardware will not be as efficient as if an HDL had been used. As the designer works in an imperative, high-level language their view of the hardware may be obscured and not easy to visualize. Furthermore, the machine-generated HDL generated by HLS tools is often opaque and near impossible to understand. These fears may be unfounded, however, as many studies have found that HLS does not degrade performance and sometimes even improve it (lahti2018).

2.5 OpenCL

⬇

1__kernel void mul_sd(

2 __global float *A,

3 __global float *B,

4 __global float *C,

5 int N) {

6 for (uint i = 0; i < N; i++)

7 C[i] = A[i] * B[i];

9__kernel void mul_nd(

10 __global float *A,

11 __global float *B,

12 __global float *C,

13 int N) {

14 uint i = get_global_id(0);

15 if (i < N)

16 C[i] = A[i] * B[i];

17}\end{lstlisting}

18\caption{Single and multiple work item OpenCL kernels for element-wise vector product}

19\label{lst:kernels}

20\end{figure}

22\noindent In 2009 the Kronos Group published the first version of the

23OpenCL (Open Computing Language) standard \citep{munshi2009}. The goal

24was to replace all vendor-specific languages and application

25programming interfaces (APIs) with a common and portable standard for

26writing high-performance code across all kinds of

27devices. An OpenCL application (unlike one written in a competing

28technology such as CUDA) can run \textit{unmodified} on any device

29with a compliant OpenCL implementation. Today OpenCL now runs on

30millions of CPUs, GPUs, DSPs, and other computing devices.

32The OpenCL standard consists of three parts; a specification for a

33programming language, a host API, and a device API. The host API runs

34on the user’s main computing device (e.g., PC) and controls their

35accelerator (e.g., GPU), whose functionality is accesible through the

36device API. Thus, OpenCL prescribes both how the programmer

37should program the accelerator and how to manage it from an external

38device.

40The OpenCL language closely resembles C. It supports functions, loops,

41multiple variable scopes, aggregate data types, and many other

42programming constructs familiar to C programmers. A big difference

43from C is that OpenCL comes with three explicit pointer spaces;

44global, local, and private. These grant the programmer fine-grained

45control over data storage. What constitutes global, local, and

46private memory is device-specific. Generally the largest and

47slowest memory space is global, while the smallest and fastest is

48private. Global memory may be sized in gigabytes, while private memory

49may only be a few kilobytes. OpenCL has native support for SIMD

50types to make it easier to write algorithms for parallel

51hardware. Unlike C, OpenCL specifies the bit width of most builtin

52types -- an \verb!int! is always 32 bits and a \verb!long! 64 bits.

54OpenCL was designed with massively parallel architectures in mind

55(e.g., GPUs) and has builtin support for concurrency in the form of

56\textit{work items}. A work item is a small indivisible unit of work

57with, ideally, no dependencies to other work items. The OpenCL runtime

58can schedule independent work items to maximally exploit the

59hardware’s potential. Consider the kernels in listing

60\ref{lst:kernels} for computing the element-wise vector

61product. Launching the first kernel -- \verb!mul_sd! -- causes OpenCL

62to instantiate one kernel on one core which runs $N$ iterations of the

63for loop. Launching the second kernel, however, causes OpenCL to

64instantiate up to $N$ kernels, each of which runs on an available core

65and computes one iteration of the for loop. Since the second kernel

66runs more computations in parallel, it likely runs much

67faster. Sometimes the algorithm’s work cannot feasibly be separated

68into completely independent work items. For such situations, work

69items that must be synchronized can be organized into \textit{work

70 groups} which share local memory. However, synchronization is

71impossible between work groups.

73\subsection{OpenCL on Intel FPGAs}

74\label{sec:ocl-fpga}

75\begin{figure}

76 \begin{lstlisting}[style=opencl]

77for (uint i = 0; i < N; i++) {

78 B[i] = h(g(f(A[i])));

79}\end{lstlisting}

80 \caption{A loop-nest that could benefit from pipeline

81 parallelism. The functions \texttt{f}, \texttt{g}, and \texttt{h} are

82 assumed to be short and inlineable.}

83 \label{lst:para}

84\end{figure}

85\begin{figure}

86 \begin{lstlisting}[style=opencl]

87channel uint jobs

88 __attribute__ ((depth(512)));

89__kernel void consumer() {

90 while (true) {

91 uint job =

92 read_channel_intel(jobs);

93 if (!msg)

94 break;

95 process_job(job);

96 }

97}

98__kernel void producer(uint N)

99 for (uint i = 0; i < N; i++) {

100 uint job = create_job(i);

101 send_channel_intel(job);

102 }

103 send_channel_intel(0);

104}\end{lstlisting}

105 \caption{Producer-consumer kernels communicating via a channel}

106 \label{lst:chans}

107\end{figure}

108

109\noindent As FPGAs work very differently from processor-based

110hardware, Intel’s OpenCL implementation for FPGAs differs in important

111ways from other OpenCL implementations. One major difference concerns

112parallelism. CPUs and GPUs have multiple cores to run multiple

113computations in parallel. Thus, the workload of an OpenCL program

114organized into multiple work items can be mapped onto multiple

115cores. FPGAs do not have any cores in the usual sense and it is

116difficult for Intel’s OpenCL FPGA compiler to synthesize code

117structured around work items into efficient FPGA bitstreams. Instead,

118Intel recommends designers to structure OpenCL code as single-work

119item kernels, and to derive most of the parallelism from executing

120multiple loop iterations concurrently; a type of concurrency known as

121‘‘pipeline parallelism’’. For example, assume the three functions

122\verb!f!, \verb!g!, and \verb!h! in listing \ref{lst:para} are short,

123side-effect free, and represent computations that can be performed in

124less than one clock cycle. The FPGA compiler can create a three-stage

125pipeline for the loop, where combinational circuits for \verb!f!,

126\verb!g!, and \verb!h! constitute the pipelines three stages. This

127pipeline can process three iterations of the loop in parallel; while

128the \verb!h!-circuit processes data for the first, the

129\verb!g!-circuit processes data for the second, and the

130\verb!f!-circuit for the third iteration, and so on. While the latency

131of every iteration is three cycles, the throughput (initiation

132interval) is only one cycle and the latency for the loop as a whole is

133$N + 2$ since it takes 2 cycles to fill the pipeline. Though, this

134assumes that the latency of every function is fixed and

135predictable. If one function executes operations with variable

136latencies, such as memory accesses, the pipeline may need to

137stall. Moreover, pipeline parallelism cannot increase the throughput

138of the loop beyond one cycle -- for that one has to use other

139techniques.

140

141The FPGA compiler supports many compiler directives

142(‘‘pragmas’’) that can help it better optimize difficult

143loops or loops containing invariants it can not

144prove on its own. Two important directives are \verb!#pragma ivdep!

145and \verb!#pragma unroll N!. The first tells the compiler that the

146loop contains no loop-carried dependencies and therefore it

147can reorder the loop iterations as it pleases. The second tells it to

148duplicate the loop body $N$ times. This means that the resulting

149circuit will use $N$ times as many gates, but, potentially, run $N$

150times faster. Other compilers ignore these directives.

151

152The FPGA compiler extends OpenCL with syntax for declaring channels

153for inter-kernel communication. Channels resemble the pipes feature in

154OpenCL 2.0 which the FPGA compiler does not support. Channels are

155implemented in on-chip memory as first in first out (FIFO) buffers of

156the desired depths and are very fast. \Cref{lst:chans} shows a simple

157producer-consumer example, where one kernel calls \verb!send_channel_intel!

158to put jobs on the FIFO and the other calls \verb!read_channel_intel!

159to remove them. Both functions block if the FIFO is full or empty.

160

161%% \Cref{fig:opencl} shows an entry point (externally callable function)

162%% to an OpenCL program -- a \textit{kernel} -- that computes the

163%% element-wise product of two vectors. The keyword \verb!__kernel!

164%% denotes that the function \verb!mul_ab! is an entry point and

165%% \verb!__global! that the associated pointers reside in the global

166%% memory space. What constitutes global memory is device dependent, but

167%% is generally the device’s largest and slowest memory. It can be

168%% contrasted with the local memory (denoted by the keyword

169%% \verb!__local!) which is both considerably smaller and faster to

170%% access than the global memory.

171

172

173%% The kernel’s \verb!#pragma! lines are compiler-specific directives

174%% that hints to the compiler how to synthesize the code. This is useful

175%% if the code contains invariants the compiler cannot prove or if the

176%% code is difficult for the compiler to optimize. The first directive

177%% \verb!#pragma ivdep! tells the compiler that there is no loop-carried

178%% dependencies in the loop nest, allowing the compiler to reorder the

179%% loop iterations or execute them in parallel. The second directive

180%% \verb!#pragma unroll 4! instructs the compiler to unroll the loop four

181%% times. This means that the resulting circuit will use four times as

182%% much hardware, but will potentially run four times as fast. Note that

183%% these directives have no effect when the code is compiled with a

184%% compiler other than with Intel’s FPGA SDK for OpenCL compiler.

185

186%% As OpenCL is tailored to take advantage of massively parallel

187%% architectures it supports dividing work loads into work groups and

188%% work items.

189

190%% \subsection{Spiking Neural Network}

191

192%% The spiking neural network (SNN) we simulate is the 1 $\mathrm{mm}^3$

193%% model of the sensory cortex developed by \cite{potjans2014cell}. It

194%% consists of 77,169 neurons grouped into four layers, each of which is

195%% subgrouped into one excitatory and one inhibitory population. About

196%% 300 million synapses connect the neurons randomly with

197%% population-specific probability densities. The neurons’ membrane

198%% voltage, $V$, is updated according to the leaky integrate-and-fire

199%% (LIF) differential equation;

200%% \begin{equation}

201%% \tau_mU’(t) = -U(t) + R_mI_{\in}(t).

202%% \end{equation}

203%% where $\tau_m$ and $R_m$ represents the time and resistance of the

204%% neuron’s membrane, and $V_{rest}$ the neuron’s resting voltage. These

205%% are all the same for all neurons in the model.

206

207%% When the membrane’s voltage crosses the threshold $V_{thresh}$ the

208%% neuron emits a spike

209

210

211\section{SNN Simulation}

212\label{sec:ssn-sim}

213

214\begin{figure}

215 \begin{lstlisting}[style=pseudocode]

216for t in range(n_tics):

217 for i in range(n_neurons):

218 N[i] = update_neuron(N[i])

219 if spikes(N[i]):

220 Q = enqueue(Q, i, t)

221 for i in range(n_neurons):

222 c = collect(Q, i, t)

223 N[i] = update_psc(Q, c)

224 Q = dequeue(Q, i, t)\end{lstlisting}

225 \caption{Synchronous SNN simulation}

226 \label{lst:basic}

227\end{figure}

228

229\noindent Having reviewed SNNs, the Potjans-Diesmann microcircuit and

230our implementation tools, we now discuss methods for simulating SNNs

231efficiently. We explore SNN simulation in general before introducing

232the ideas and algorithms we use to optimize our simulators. The end

233result is a taxonomy of twelve simulator families, grouped by their

234algorithms and implementation styles. We illustrate most of our points

235with Pythonesque pseudo-code, extended with two keywords;

236\verb!parfor! and \verb!atomic!. The former for loops whose iterations

237are independent and therefore can be executed in parallel. The latter

238for operations that must execute as one indivisible unit. Text in

239\verb!{brackets}! explains what the pseudo-code should do at that

240point.

241

242Broadly speaking, SNN simulation can be categorized on whether it is

243synchronous (time-step**) or asynchronous

244(event-driven). Synchronous simulation updates the state of every

245simulated object at every tick of a clock, regardless of whether it is

246necessary or not \citep{brette2007}. Asynchronous simulation only

247updates simulated objects whey they receive external stimuli, i.e.,

248events. In an asynchronous SNN simulation, spike emission and

249reception constitute the events, since the state of a neuron at a time

250between two events can be calculated easily (and generally is

251unimportant). Hybrid strategies, with asynchronous updates for some

252parts of the SNN and synchronous updates for other parts, are

253possible.

254

255For SNN simulation, asynchronicity offers precision advantages. The

256neuron’s membrane potential only has to be recomputed when it receives

257spikes. If this happens only rarely the simulator can afford to use

258more sophisticated methods than (repeated) forward Euler to solve the

259LIF equation (\cref{eqn:lif}). Also, spikes can be sent and received

260at any time and do not have to be confined to a discrete grid. For

261example, a synchronous simulator with a 0.1 ms time steps may not be

262able to represent spikes sent at times that are not multiples of 0.1

263ms. Asynchronicity may also have scalability advantages as the

264neurons’ states do not have to be synchronized to a global

265clock. However, asynchronicity entails irregular computation and

266irregular memory accesses -- traits that are extremely undesirable on

267modern hardware. Furthermore, dense SNNs have many more synapses than

268neurons which results in cascading effects. One spiking neuron ‘‘wakes

269up’’ thousands of its neighbours, causing them to spike and in turn

270wake up thousands of their neighbours. On the whole, the event

271processing overhead may dominate over whatever computational savings

272not being bound by a global clock brings. For example,

273\cite{pimpini2022} presents a sophisticated asynchronous CPU-based SNN

274simulator that supports speculative execution so that future neuron

275states can be computed in advance and then rolled back if received

276spikes (‘‘stragglers’’) invalidates their predicted state. While the

277authors measured accuracy improvements, the performance was

278lackluster. For these reasons, most high-performance SNN simulators,

279including ours, are synchronous.

280

281Synchronous simulation splits the simulation task into two phases; one

282for updating neurons and one for transferring spikes. Listing

283\ref{lst:basic} shows the basic algorithm. It uses two data

284structures; an indexable data structure, \verb!N!, to stores the state

285of all neurons and a queue, \verb!Q!, to keeps track of spikes in

286flight. For every neuron the algorithm calls \verb!update_neuron! to

287update its membrane potential, presynaptic current, and refractoriness

288in accordance with \cref{eqn:mem,eqn:psc,eqn:ref}. If \verb!spikes!

289indicates that the neuron spikes, it enqueues the neuron’s index $i$

290and the current time step $t$ in \verb!Q!. In the next phase (line 6

291to 9) the algorithm calls \verb!collect! on the queue to aggregate

292current destined to the $i$:th neuron at time step $t$. The call to

293\verb!update_psc! adds the aggregated current to the neuron’s

294presynaptic current. Finally, the algorithm removes the current it

295just handled from the queue. In this listing we include the

296\verb!for t in range(n_tics)! loop which shows that the algorithm

297repeats the two phases \verb!n_tics! times. However, for brevity’s

298sake, we omit this outer loop in the following listings.

299

300The work required for updating the neurons’ state for one tick is in

301the order of $O(N)$, where $N$ is the number of neurons --

302i.e. linear. However, for transferring spikes it is $O(f\Delta

303tpN^2)$, where $p$ is the probability that two randomly chosen neurons

304are connected by one or more synapses, $f$ the average spiking rate,

305and $\Delta t$ the tick duration. So the transfer time is proportional

306to the network’s density and quadratic in $N$ -- i.e. for large $N$ it

307dominates. Furthermore, updating the membranes and presynapses

308require a handful of multiplications per neuron -- cheap on modern

309hardware -- whereas spike transfer requires expensive reads and writes

310to and from non-contiguous memory. Hence, we will focus on the

311transfer phase which is where synchronous simulators spend most of

312their time in the remainder of this section.

313

314\subsection{Pushing and Pulling}

315\label{sec:push-pull}

316\begin{figure}

317 \begin{lstlisting}[style=pseudocode]

318parfor i in range(n_neurons):

319 spikes = False

320 if R[i] == 0:

321 x = p22*U[i] + p21*I[i]

322 spikes = x >= U_thresh

323 if spikes:

324 x = 0

325 R[i] = t_ref_tics

326 U[i] = x

327 else:

328 U[i] = 0

329 R[i] -= 1

330 if spikes:

331 parfor j, d, w in syns_from(i):

332 atomic W[t + d, j] += w

333 I[i] = p11*I[i] + T[t, i]*wpsn

334 I[i] += W[t, i]\end{lstlisting}

335 \caption{One time step of push-based spike transfer}

336 \label{lst:push}

337\end{figure}

338

339\begin{figure}

340 \begin{lstlisting}[style=pseudocode]

341parfor i in range(n_neurons):

342 A[t, i] = False

343 if R[i] == 0:

344 x = p22*U[i] + p21*I[i]

345 if x >= U_thresh:

346 A[t, i] = True

347 x = 0

348 R[i] = t_ref_tics

349 else:

350 x = 0

351 R[i] -= 1

352 U[i] = x

353 s = 0

354 for j, d, w in syns_to(i):

355 if A[t - d, j]:

356 s += w

357 I[i] = p11*I[i] + T[t, i]*wpsn + s\end{lstlisting}

358 \caption{One time step of pull-based spike transmission}

359 \label{lst:pull}

360\end{figure}

361

362\noindent Researchers have identified ‘‘pushing’’ and ‘‘pulling’’ as

363two general strategies for designing algorithms for graph problems

364(\cite{besta2017}; \cite{grossman2018}; \cite{ahangari2023}). A

365push strategy transfers signals \textit{from a node to its neighbours}

366by writing to the neighbours incoming signal buffers. In contrast, a

367pull strategy transfers signals \textit{to a node from its neighbours}

368by swee** through the node’s neighbours and checking whether they

369have any signals to be delivered to the node. I.e., the receiver node

370has to ‘‘go and ask’’ its neighbours whether they have signals for

371them. SNN simulation is a graph problem involving node-to-node

372transfer of signals and it too can be characterized in terms of

373pushing and pulling. Listing \ref{lst:push} and \ref{lst:pull}

374illustrate the two strategies. In both listings, the values of the

375scalars \verb!p11!, \verb!p21!, and \verb!p22! come from

376\cref{eqn:p11p22,eqn:p21}. The arrays \verb!U!, \verb!I!, and \verb!R!

377contain the neurons’ membrane potentials, presynaptic potentials, and

378refractory counters. When the neuron’s membrane potential exceeds

379\verb!U_tresh! the neuron spikes and becomes refractory for

380\verb!t_ref_tics! tics. The expression \verb!T[t, i]! denote the

381number of thalamic spikes received by the $i$:th neuron at the $t$:th

382time step.

383

384The strategies differ in how they transfer spikes. The push strategy

385transfers them when \verb!spikes! indicates that a neuron spikes. It

386calls \verb!syns_from! to fetch an iterator of all synapses

387\textit{originating} from the $i$:th neuron. The three-tuples $(j, d,

388w)$ it retrieves represent the synapses; $j$ the index of the

389destination neuron, $d$ the delay in time steps, and $w$ the

390current. For every three-tuple it writes to an element of \verb!W!, a

391two-dimensional array buffering current to be delivered. The element

392\verb!W[t, i]! is the amount of presynaptic current the $i$:th neuron

393receives at the $t$:th time step. The \verb!syns_to! call in listing

394\ref{lst:pull} works exactly like the \verb!syns_from! call, except it

395returns all synapses \textit{terminating} at the $i$:th neuron and $j$

396in the three-tuples $(j, d, w)$ denotes the \textit{originating}

397neuron. Array elements \verb!A[t, i]! indicate whether the $i$:th

398neuron spikes at time $t$ or not. The size of the longest synapse

399delay is in practice bounded and small so both \verb!A! and \verb!W!

400can be implemented as statically sized wrap-around arrays -- a

401technique we cover in \cref{sec:sizing}. Note that every neuron can be

402handled independently so we use the \verb!parfor! construct for both

403algorithms’ outer loops.

404

405Pushing of synaptic current to their neighbours happens on line 16 and

40617 of the push strategy’s listing. Neurons spike infreqently, but when

407they do, the algorithm must retrieve all their synapses and write

408their current to the array \verb!W!. This part of the algorithm is

409performance-critical. First, the algorithm accesses all the neuron’s

410synapses which can be expensive, even if they are stored in contiguous

411memory. Second, the algorithm gathers and scatters data from and to

412uncorrelated indices of \verb!W!. These are costly operations since

413the accesses to \verb!W! cannot be coalesced or cached.\footnote{The

414memory addresses are too far apart for any coalescing or caching

415scheme to be efficient.} To make matters worse, at one time step

416multiple neurons can write to the same indices of \verb!W!. I.e.,

417\verb!W[t + d, j] += w! has to be executed atomically to avoid data

418races. As the model is densely connected, races are not uncommon and

419should be accounted for. Partition-awareness, as suggested by

420\cite{besta2017}, is not an option because of the density and

421randomness of the connections, making most of them remote and not

422local.

423

424\begin{figure}

425 \begin{lstlisting}[style=pseudocode]

426for i in range(n_neurons):

427 {Update neuron state as before}

428 if spiked:

429 Q = enqueue(Q, i)

430for i in contents(Q):

431 parfor j, d, w in syns_from(i):

432 W[t + d, j] += w

433Q = clear(Q)\end{lstlisting}

434\caption{Deferred push-based spike transfer}

435\label{lst:push2}

436\end{figure}

437

438The pull-based algorithm instead has the receiver neurons responsible

439or ‘‘pulling in’’ current. Spiking merely sets an element in \verb!A!

440to true and does not trigger current transfer. On subsequent time

441steps, neurons connected to that neuron checks if it spiked at time

442$t-d$ and, if so, adds its synaptic current. The upside of this

443algorithm is that it is synchronization-free; neurons do not write to

444shared memory. The major drawback is that every neuron at every time

445step must check all its incoming synapses to see from which of them it

446receives current. Additionally, the algorithm reads from scattered

447memory on line 18. The large amount of data it reads probably makes

448it inefficient, unless the number of computational units is

449large and memory reads are substantially cheaper than writes. Neither

450of which is true for our FPGA. Consequently, we focus on push-based

451spike transfer in this work.

452

453Listing \ref{lst:push2} shows a variant of push-based spike transfer

454that first collects spiking neurons and then transfers their synapses

455current in a dedicated phase. Collection can either be done with a

456marking array, at the expensive of wasting memory, or with a queue (as

457in the listing), at the expense of making loop iterations

458dependent. The choice depends on the target platform. Though,

459splitting the update and spike transfer into two phases reduces the

460number of cache conflicts which is advantagenous. Lines

46113 to 15 of the basic push algorithm can flush prefetched and cached

462parts of the \verb!U!, \verb!I!, and \verb!R! arrays.

463

464\subsection{Buffer Sizing and Wrap**}

465\label{sec:sizing}

466

467To reduce the size of the \verb!W! array we use ‘‘wrap-around

468indexing’’. The indices of the rows in \verb!W! written to in one time

469step lies within the interval $t + 1$ to $t + d_\mathrm{max} - 1$,

470where $d_\mathrm{max} - 1$ is the network’s largest synaptic delay and

471is small. Moreover, at time step $t$ rows $0$ to $t - 1$ will not be

472read again so that space can be reused. We do that by setting the

473number of rows in \verb!W! to $d_\mathrm{max}$ and use modular

474arithmetic to index rows. The expressions for accessing \verb!W! on

475line 15 and 17 in listing \ref{lst:push} become

476\verb!W[(t + d) % D_MAX, j]! and \verb!W[t % D_MAX, i]!,

477respectively. We choose a large enough value for $d_\mathrm{max}$ by

478evaluating the cumulative distribution function of the Gaussians we

479sample the slower excitatory synapses delays from, $\mathcal{N}(1.5,

4800.75)$:

481\begin{equation}

482 P(\mathcal{N}(1.5, 0.75) \leq 6.4) \approx 0.999999999968:

483\end{equation}

484This shows that with $\Delta t = 0.1$, 64 rows is more than enough

485since the probability of sampling a synaptic delay longer than 6.4 ms

486is astronomically low. It is also a power of two so we use masking to

487realize the modular arithmetic. With this scheme we also have to clear

488\verb!W[t, i]! after reading it to avoid double reads.

489

490Even with only 64 rows and assuming four-byte floats, the \verb!W!

491array still consumes $64\cdot4\cdot 77169 \approx 20$ megabytes which

492is more than we can fit in on-chip memory. We could store the array in

493off-chip memory -- which is plentiful -- or use half-precision two-byte

494floats instead. Neither solution is satisfactory. As we argued in

495\cref{sec:push-pull}, we need fast reads and writes to uncorrelated

496addresses of \verb!W! which off-chip memory doesn’t give us. We also

497prefer not to lower the numeric the precision as that makes the

498simulation less accurate. A third option is to store all spiking

499neurons in a queue and only activate a subset of their synapses at a

500given time step.

501

502\subsection{Just-In-Time Spike Transfer}

503\begin{figure}

504 \begin{lstlisting}[style=pseudocode]

505parfor i in range(n_neurons):

506 {Update U[i], R[i] as before}

507 spiked[i] = {True if neuron spiked}

508 I[i] = p11*I[i] + T[t, i]*wpsn

509for rt in range(D_MAX):

510 delay = (t - rt) % D_MAX

511 if delay < D_MAX - 1:

512 for n in enqueued_at(Q, rt):

513 syns = syns_from(n, delay)

514 parfor j, d, w in syns:

515 I[j] += w

516 else:

517 Q = dequeue(Q, rt)

518rt = t % D_MAX

519for i in range(n_neurons):

520 if spiked[i]:

521 Q = enqueue(Q, rt, i)\end{lstlisting}

522 \caption{Three-phase just-in-time spike transfer}

523\label{lst:jit}

524\end{figure}

525

526\noindent By activating all the neuron’s synapses at once, the push

527algorithm works harder than necessary. Obviously, all synapses must

528\textit{eventually} be activated, but \textit{right now} only synapses

529with a delay of one time step must be activated since the current they

530transfer will be read at the next time step. This observation leads us

531to a ‘‘lazy’’ just-in-time algorithm which keeps all spiking neurons

532in a queue for a fixed number of time steps. At every time step it

533activates those synapses whose current is read at the next time

534step. Thus, a neuron that spikes at time $t$ will have its

535one-tick-delay synapses activated at time $t + 1$, its two-tick-delay

536synapses at time $t + 2$, and so on. Listing \ref{lst:jit} sketches a

537three-phase algorithm built on this idea. The first phase (line 1 to

5384) updates \verb!U! and \verb!R! as before and marks spiking neurons

539in \verb!spiked!. The third phase (lines 14 to 17) enqueues the marked

540neurons with a ‘‘relative timestamp’’, \verb!rt!. The second phase

541(lines 5 to 13) scans the queue and calls \verb!enqueued_at! to fetch

542previously enqueued neurons. The \verb!syns_from! function works as

543before, but now has a second parameter to select synapses with the

544given delay. Suppose that the simulator simulates time step 100 and

545that \verb!rt! is 10 in one iteration of the loop. Since $26 \equiv

546(100 - 10) \mod d_{\mathrm{max}}$ ($d_{\mathrm{max}}=64$) all synapses

547with 26 time steps of delay of queued neurons with a relative

548timestamp of 10 will be activated. These neurons ought to have been

549stored at time step $t = 100 - 26 = 76$ which is indeed the case since

550$10 \equiv 74 \mod d_\mathrm{max}$. After the neurons have stayed in

551the queue for \verb!D_MAX - 1! time steps, the \verb!dequeue! call

552evicts them. Though with a sensible FIFO implementation eviction is a

553no-op so we omit it in future pseudo-code.

554

555How much memory do the just-in-time algorithm save? Through

556experimentation we found that the average number of spiking neurons

557per time step is 23, meaning that the expected number of

558elements in \verb!Q! is $23\cdot d_{\mathrm{max}}=1472$. We generously

559round it up to 4096 and, as neuron indices take four-bytes to store,

560the total size of the queue is 16 kilobytes, which comfortably fits in

561on-chip memory.\footnote{As in \cref{sec:sizing}, we could use the

562cumulative distribution function to show that 4096 elements is enough

563to make the risk of overflow virtually zero.}

564

565To improve the just-in-time algorithm we use ‘‘lanes’’. Essentially,

566we duplicate the \verb!W! array so that synapses of multiple spiking

567neurons can be activated simultaneously. The algorithm writes the

568synaptic current of the first neuron it handles in the first lane, of

569the second neuron in the second lane, and so on. Synaptic current of

570neurons in different lanes does not interfere. The update phase has to

571be adjusted accordingly and must sum the incoming current from all

572lanes. As each lane consumes about 320 kb of memory we can fit at most

57316 lanes.

574

575\subsection{Horizon-Based Spike Transfer}

576\begin{figure}

577 \begin{lstlisting}[style=pseudocode]

578parfor i in range(n_neurons):

579 I[i] += W[t % H, i]

580 W[t % H, i] = 0

581 {update U[i] and R[i] as before}

582 I[i] = p11*I[i] + wspn*T[t, i]

583for i in range(D_MAX / H):

584 rt = (t - H*i - 1) % D_MAX

585 d_from = H*i + 1

586 d_to = d_from + H

587 for n in enqueued_at(Q, rt):

588 syns = syns_from(

589 n, d_from, d_to)

590 parfor j, d, w in syns:

591 W[(d + t) % H, j] += w

592rt = t % D_MAX

593for i in range(n_neurons):

594 if spiked[i]:

595 Q = enqueue(Q, rt, i)\end{lstlisting}

596 \caption{Three-phase spike transfer with configurable horizon}

597 \label{lst:horizon2}

598\end{figure}

599

600\noindent The just-in-time algorithm requires a fair amount of

601bookkee**. It activates the spiking neuron’s synapses

602$d_{\mathrm{max}}$ times instead of just once, and the loop on line 10

603and 11 iterates many fewer times than the corresponding loop on line

60414 and 15 of the basic push algorithm. This loop is an important

605source of parallelism and running it many times with fewer iterations

606is much worse for performance than running it fewer times with many

607iterations. Our solution is to reintroduce a smaller version of the

608spike buffer whose number of rows, $h$ is a factor of $d_\mathrm{max}$

609so that when a neuron spikes we write to the buffer $d_\mathrm{max} /

610h$ times. Listing \ref{lst:horizon2} shows the concept. The

611\verb!syns_from! function now retrieves synapses of the neuron whose

612delay is within the range $d_\mathrm{from}$ to $d_\mathrm{to} -

6131$. Suppose $d_\mathrm{max}=64$, $t=100$ and $h=16$. The loop on lines

6145 to 13 iterates four times, the relative timestamps assumes the

615values 35, 19, 3, and 51, and the half-open intervals the values [1,

616 17), [17, 33), [33, 49), and [49, 65). Thus, the inner loop on line

617 10 to 14 activates all synapses with the given delays of the

618 neurons stored at the given relative timestamps. With this

619 scheme we trade-off on-chip memory for better concurrency.

620

621Note that we add the current from the spike buffer to the presynaptic

622potential on line 2, before we update the membrane’s potential and

623we add one to \verb!d_from! on line 7. This is necessary since the

624algorithm transfers spikes one time step later than the basic

625push algorithm.

626

627\subsection{Storing Synapses}

628\label{sec:syn-store}

629\begin{figure}

630 \begin{lstlisting}[style=pseudocode]

631def syns_from(n, d_from, d_to):

632 start = X[n, d_from]

633 end = X[n, d_to]

634 for i in range(start, end):

635 yield S[i]\end{lstlisting}

636\caption{Indexed access to synapse data}

637\label{lst:index}

638\end{figure}

639

640\noindent Previous sections’ pseudo-codes imply that it is very

641important that synapses can be queried by their sender neuron

642quickly. In particular, synapses should be stored so that one

643\verb!syns_from! call only accesses memory in one contiguous chunk. We

644fullfill these goals by storing the synapses as an array sorted on

645sender neuron, receiver neuron, and delay that we query with a

646prebuilt index keyed on sender neuron and delay. This means that

647finding all synapses for a particular neuron or neuron-delay

648combination requires only two index lookups; one for the first synapse

649and one for the last synapse. Moreover, as the index implicitly stores

650the sender neuron, we only need eight bytes to represent a synapse;

651four for the single-precision weight (32 bits), three for the

652destination neuron’s id (17 bits), and one for the synaptic delay (6

653bits).\footnote{With just-in-time transfer the delay is also stored

654implicitly.} Listing \ref{lst:index} shows how \verb!syns_from!

655performs index look-ups. The two-dimensional array \verb!X!

656represents the index and \verb!S! the synapse array so that

657\verb!X[n, d_from]! contains the index in \verb!S! where the first

658synapse of neuron \verb!n! with delay \verb!d_from! is

659stored. Ideally, the synapses should also be prefetched.

660

661\begin{figure}

662 \begin{lstlisting}[style=pseudocode]

663parfor c in range(N_CLS):

664 for i in contents(Q):

665 syns = syns_from(i, c)

666 for j, d, w in syns:

667 W[t + d, j] += w

668Q = clear(Q)\end{lstlisting}

669\caption{Spike transfer with partitioned synapses}

670\label{lst:parts}

671\end{figure}

672

673\subsection{Disjoint Synapses}

674\label{sec:disjoint}

675

676A headache for push-based spike transfer is the data race caused by

677multiple synapses delivering current received by the same neuron at

678the same time step. This is why we can’t use \verb!parfor! on line 5

679of listing \ref{lst:push2}, for example. We cannot completely solve

680this problem, but we can alleviate it by partitioning the synapses

681into disjoint \textit{classes}, so that synapses of different classes

682never trigger writes to the same memory addresses at the same

683time. Then every class can be handled by a separate thread. The

684pseudocode in listing \ref{lst:parts} modifies the transfer phase of

685the deferred push algorithm from \ref{lst:push2} to exploit of this

686idea.\footnote{We can of course us the same technique to improve all

687other algorithms we have discussed.} The constant \verb!N_CLS!

688denotes the number of synapse classes and the extra argument to

689\verb!syns_from! which synapse class to query. Many ways of

690partitioning the synapses into disjoint classes are possible. For

691example, by delay so that one thread writes one-time step synapses,

692the next thread two-time steps synapses, and so on. Another by

693destination neuron so that one thread handles synapses going to

694neurons whose index is between 0 and 199, another those

695between 200 and 399, and so on.

696

697\begin{figure}

698 \begin{lstlisting}[style=pseudocode]

699o0 = X[i][0]

700o1 = X[i][D_MAX]

701parfor c in range(N_CLS):

702 for o in range(o0 + c, o1, N_CLS):

703 j, d, w = S[o]

704 W[(t + d) % D_MAX, j] += w\end{lstlisting}

705 \caption{Spike transfer over interleaved synaptic storage.}

706 \label{lst:parts2}

707\end{figure}

708

709We choose to partition by the destination neuron’s congruence

710class. The method has low overhead and evenly distributes the synapses

711over the classes since the least significant bits of the neuron index

712is almost random. To retain the contiguous storage we interleave the

713synapses. That is, if a neuron’s synapses are found at indexes $o_0$

714to $o_1 - 1$, then all synapses of class $c$ are stored at indices

715$o_0 + n_ci + c$, where $n_c$ is the number of classes and $i$ is a

716non-negative integer. Interleaving synergizes with banked memory

717common on many GPUs. On our FPGA, it means that we can have

718conflict-free dedicated memory ports for every synapse class. The

719method causes some memory waste however. For example, if there are

720four classes and all destination neuron indices of all synapses of

721some neuron happen to be congruent with $2 \mod 4$, then all indices

722other than $o_0 + 4i + 2$ will be vacant. I.e., 75% of the space will

723go to waste. In general, the memory consumption for storing a neuron’s

724synapses grows from $n_ss$, where $n_s$ is its number of synapses and

725$s$ the synapse size to $n_cls$, where $l$ is the number of synapses

726in the neuron’s largest class. In practice, the memory waste

727is manageable; in the order of 5-30% depending on the number of

728classes and horizon. The more the classes and the shorter the horizon

729the more uneven the classes become and the more waste. We do not mark

730vacancies and instead fill them with synapses carrying no current and

731terminating at idempotent neurons. This way, the code in listing

732\ref{lst:parts2} does not need to check whether the index is vacant.

733

734\subsection{More on Parallelism}

735\label{sec:more-para}

736\begin{figure*}

737 \begin{subfigure}{0.48\textwidth}

738 \begin{lstlisting}[style=pseudocode]

739parfor i in range(n_neurons):

740 {Update neuron state as before}

741 if spikes:

742 write(to_transfer, i)

743write(to_update, DONE)

744read(to_update)\end{lstlisting}

745 \caption{Update kernel}

746 \end{subfigure}

747 \begin{subfigure}{0.48\textwidth}

748 \begin{lstlisting}[style=pseudocode]

749while True:

750 i = read(to_transfer)

751 if i == DONE:

752 break

753 parfor j, d, w in syns_from(i):

754 W[t + d, j] += w

755 write(to_update, True)\end{lstlisting}

756 \caption{Transfer kernel}

757 \end{subfigure}

758 \caption{Basic spike transfer using two kernels}

759 \label{lst:multi}

760\end{figure*}

761

762\noindent Our algorithms’ source of parallelism is the \verb!parfor!

763keyword. Iterations of such loops are independent and can be executed

764concurrently in duplicated hardware. The replication is realized

765differently on different targets. On CPUs we use SIMD and on GPUs the

766single-instruction multiple threads (SIMT) execution model

767‘‘automatically’’ parallelizes the computation. With Intel’s OpenCL

768SDK we use the \verb!#pragma unroll N! and \verb!#pragma ivdep!

769compiler directives to instruct the compiler to replicate loop

770hardware. Essentially, this ‘‘widens’’ data paths, allowing more data

771to be processed in parallel. But FPGAs can also run more data paths in

772parallel, akin to how multiple CPU cores can run multiple threads. We

773implement this parallelism by dividing the algorithms over multiple

774concurrent communicating kernels.

775

776Listing \ref{lst:multi} restructures the deferred push-based spike

777transfer algorithm in \ref{lst:push2} as two kernels. It uses two

778blocking FIFOs, \verb!to_transfer! and \verb!to_update!, that the

779kernels can \verb!read! and \verb!write! to. When a neuron spikes, the

780update kernel sends its index to the spiking kernel which activates

781that neuron’s synapses. Hence, the neuron update and spike transfer

782phases run in parallel. When the update kernel has updated all neurons

783it sends a \verb!DONE! message and waits for a reply from the spike

784transfer kernel. When it receives one, it knows that the spike

785transfer kernel has transferred all spikes and it proceeds to the next

786time step. As inter-kernel communication is not well-supported in

787OpenCL, we implement it using the low-latency Intel-specific channel

788extension.

789

790\begin{figure*}

791 \begin{subfigure}{0.48\textwidth}

792 \begin{lstlisting}[style=pseudocode]

793for i in range(n_neurons):

794 I[i] += read(to_update)

795A = []

796for i in range(n_neurons):

797 {Update neuron state as before}

798 if spiked[i]:

799 A.append(i)

800write(to_transfer, A)\end{lstlisting}

801 \caption{Update kernel}

802 \end{subfigure}

803 \begin{subfigure}{0.48\textwidth}

804 \begin{lstlisting}[style=pseudocode]

805for i in range(n_neurons):

806 write(to_update, W[t % H, i])

807 W[t % H, i] = 0

808for i in range(D_MAX / H):

809 rt = (t - H*i - 1) % D_MAX

810 for n in enqueued_at(Q, rt):

811 syns = syns_from(n, rt + H)

812 parfor j, d, w in syns:

813 W[(d + t) % H, j] += w

814for n in read(to_transfer):

815 Q = enqueue(Q, t % D_MAX, n)\end{lstlisting}

816 \caption{Transfer kernel}

817 \end{subfigure}

818 \caption{Horizon-based spike transfer using two kernels}

819 \label{lst:multi-horiz}

820\end{figure*}

821

822Listing \ref{lst:multi-horiz} similarily restructures the

823horizon-based algorithm as two kernels. Since the \verb!W! array is

824stored in local memory, the multi-kernel approach requires an explicit

825synchronization step (lines 1 and 2, and 1 to 3) for sending

826presynaptic current from the transfer kernel to the update kernel over

827the \verb!to_update! channel. After synchronization, the transfer

828kernel can freely write synaptic current from queued neurons to

829\verb!W!. While updating neurons, the update kernel keeps track of

830which of them spiked and sends them to the transfer kernel. The

831transfer kernel adds them to its queue \verb!Q!.

832

833\subsection{Implementations}

834\label{sec:impl}

835\begin{figure}

836 \center

837 \includegraphics[width=\linewidth]{tree}

838 \caption{Taxonomy of our simulator implementations}

839 \label{fig:taxonomy}

840\end{figure}

841\begin{figure*}

842 \center

843 \includegraphics[width=\linewidth]{impl_gmem}

844 \caption{Schematic of the monolithic gmem implementations}

845 \label{fig:gmem}

846\end{figure*}

847\begin{figure*}

848 \center

849 \includegraphics[width=\linewidth]{impl_horizon}

850 \caption{Schematic of the multi-kernel horiz implementations}

851 \label{fig:gmem}

852\end{figure*}

853

854Given the algorithms and data structures presented, we have created

855twelve simulator families arranged into a taxonomy in

856\cref{fig:taxonomy}. The taxonomy’s second level groups the simulators

857on their spike transfer algorithm. It can either be the basic

858push-algorithm from listing \ref{lst:push} (gmem), the just-in-time

859algorithm from listing \ref{lst:jit} (jit), or the horizon-based

860algorithm from listing \ref{lst:horizon2} (horiz). These three

861algorithms differ in how eagerly they activate synapses. The gmem

862algorithm is maximally eager and activates all the spiking neuron’s

863synapses at once. Thus, it has to use global off-chip memory to transfer

864spikes, hence its name. The jit algorithm is maximally lazy and only

865activates synapses whose current should arrive at the next time

866step. The horizon algorithm is a mix of the two and activates as many

867synapses as its horizon allows. The taxonomy’s third level shows

868whether the simulator uses multiple communicating kernels (multi) or a

869single, monolithic kernel (mono). The last level specifies whether the

870simulator uses double (d) or single precision (s) floating point

871values for neuron state. This applies to the membrane potentials,

872presynaptic currents, and their related coefficients, but not to the

873spikes and spike transfer buffers which always are in

874single-precision. In the following, we use a notation based on their

875taxonomic grou** to refer to the simulators. For example,

876‘‘horiz/multi/s’’ refers to the horizon-based multi-kernel

877single-precision simulators.

878

879All simulators also have parameters for tuning their performance

880characteristics. Some are unique to certain families while others are

881shared. The \textit{update width} is the most

882fundamental one and controls how many neurons the simulator updates at

883once. We implement it by unrolling loops such as the loop on line 1 to

8844 of listing \ref{lst:jit}. More loop-unrolling is not always better

885as it causes the compiler to duplicate the hardware used to synthesize

886those loops. The gmem/mono and gmem/multi simulators have a

887\textit{synapse unroll} (SU) parameter which controls how many times

888their spike transfer loops are unrolled (line 6 and 7 of listing

889\ref{lst:push2} and lines 5 and 6 of listing \ref{lst:multi}). More

890unrolling allows more synapses to be fetched from memory

891simultaneously. The \textit{synapse classes} (SC) parameter determines

892the number of congruence classes (see \cref{sec:disjoint}). More

893classes allows more synapses to be activated in parallel, but also

894increases memory usage. The horiz simulators have a

895\textit{horizon length} (H) parameter which determines how many time

896steps worth of synapses the simulator activates at once. The jit

897simulators have a \textit{lane count} (LC) parameter that controls how

898many lanes they uses to transfer current. For convenience and

899efficiency, all parameters are positive powers of two.

900

901\section{Results}

902\label{sec:results}

903

904We evaluate our simulators on three axes -- correctness, speed, and

905energy consumption -- using the same experimental regimen for each. We

906synthesize every simulator for our FPGA with Intel FPGA SDK for OpenCL

907version 21.2 and use it to simulate ten seconds of biological time

908with a time step of 0.1 ms (i.e. 100 000 ticks) on ten different

909random microcircuit instances. From these 100 runs we compute mean

910values. We run all simulators on the full version (scale 1.0) of the

911microcircuit. Due to time constraints, we only synthesize a handful of

912simulators for every family and parameter combination. Some simulators

913could achieve better performance with a more exhaustive search.

914

915\subsection{Correctness}

916\begin{figure*}

917 \center

918 \includegraphics[width=\linewidth]{scatter}

919 \caption{Spike plot of the first 1000 ms of simulation. The

920 rhythmic nature of the microcircuit’s spiking pattern is apparent.}

921 \label{fig:scatter}

922\end{figure*}

923\begin{figure*}

924 \begin{subfigure}{.33\textwidth}

925 \centering

926 \includegraphics[width=0.9\linewidth]{counts}

927 \caption{Spiking rate}

928 \end{subfigure}

929 \begin{subfigure}{.33\textwidth}

930 \centering

931 \includegraphics[width=0.9\linewidth]{cvisi}

932 \caption{CV ISI}

933 \end{subfigure}

934 \begin{subfigure}{.33\textwidth}

935 \centering

936 \includegraphics[width=0.9\linewidth]{pearsonccs}

937 \caption{Pearson corr. coeff.}

938 \end{subfigure}

939 \caption{Kernel density estimates of \textbf{a)} spikes per second,

940 \textbf{b)} covariance of interspike intervals, and \textbf{c)}

941 Pearson correlation coefficient between binned spike trains for

942 neuron samples. Blue lines for NEST, gold for gmem/mono/s, and

943 green for jit/mono/s.}

944 \label{fig:stats}

945\end{figure*}

946\begin{figure}

947 \center

948 \includegraphics[width=\linewidth]{kldiv}

949 \caption{Kullback-Leibler (KL) divergence between distributions of

950 single-neuron firing rates, CV ISIs, and Pearson correlation

951 coefficients. Blue bars show KL divergences between NEST

952 simulations run on different random seeds, gold and green bars

953 between NEST and gmem/mono/s and jit/mono/s run on the same random

954 seeds.}

955 \label{fig:kldiv}

956\end{figure}

957

958We verify our simulators’ correctness using the same methods

959\cite{knight2018}; \cite{vanalbada2018}; \citep{potjans2014}; and

960others used. I.e., we simulate the microcircuit initialized with the

961same random seed with our simulators and with grid-based

962(double-precision) NEST which we treat as our reference. We discard

963spikes during the first second and compute the following statistics

964over the remaining nine seconds over the eight populations; rate of

965spikes per second, covariant of interspike intervals, and Pearson

966correlation coefficient over binned spike trains. We smooth each

967distribution with Gaussian kernel density estimation with bandwidth

968selected using Scott’s Rule.

969

970\Cref{fig:stats} shows the distributions plotted for NEST in blue, the

971gmem/mono/s family in gold, and the jit/multi/s family in green. The

972plots indicate that the simulators produce distributions that are very

973similar to NEST’s which implies that the accuracy loss caused by the

974reduced numerical precision is negligible. We do not plot the

975distributions for our other families of single-precision simulators as

976they are even closer to NEST’s distributions. Neither do we plot

977distributions for our double-precision simulators as they produce

978results that are spike-for-spike identical to NEST. The

979Kullback-Leibler (KL) divergence between the distributions, shown in

980\cref{fig:kldiv}, quantifies the apparent similarities. The figure’s

981blue, gold, and green bars show the KL divergences between two NEST

982simulations initialized with different random seeds, between NEST and

983gmem/mono/s, and between NEST and jit/mono/s. The latter two

984initialized with identical seeds. The divergence between two random

985seeds are much larger than between NEST and our simulators, indicating

986that they are accurate.

987

988The non-associativity of IEEE 754 floating-point is the main reason

989for the small differences. The order presynaptic current is added

990depends on the spike transfer algorithm, which affects rounding. The

991differences are stochastic and do not bias the result.

992

993\subsection{Simulation speed}

994

995%% \begin{table*}

996%% \centering

997%% \footnotesize

998%% \begin{tabular}{crrrrrrrrrr}

999%% \toprule

1000%% \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{RTF} & \textbf{Fmax} & \textbf{ALUT} & \textbf{FF} & \textbf{RAM} & \textbf{DSP}\\

1001%% \midrule

1002%% gmem/mono/s & 1 & 2 & n/a & n/a & 5.84 & 608 & 5% & 3% & 13% & 0%\\

1003%% ’’ & 2 & 2 & n/a & n/a & 5.24 & 608 & 5% & 3% & 14% & 0%\\

1004%% ’’ & 4 & 2 & n/a & n/a & 4.88 & 608 & 5% & 3% & 14% & 1%\\

1005%% ’’ & 8 & 2 & n/a & n/a & 4.78 & 609 & 6% & 3% & 13% & 1%\\

1006%% ’’ & 8 & 4 & n/a & n/a & 4.82 & 607 & 7% & 4% & 15% & 1%\\

1007%% ’’ & 8 & 8 & n/a & n/a & 5.07 & 572 & 8% & 5% & 17% & 1%\\

1008%% ’’ & 16 & 2 & n/a & n/a & 4.63 & 607 & 7% & 4% & 14% & 1%\\

1009%% ’’ & 16 & 4 & n/a & n/a & 4.72 & 607 & 8% & 4% & 15% & 2%\\

1010%% ’’ & 32 & 2 & n/a & n/a & 4.66 & 600 & 9% & 4% & 14% & 3%\\

1011%% \specialrule{0.25pt}{1pt}{1pt}

1012%% gmem/mono/d & 8 & 2 & n/a & n/a & 4.75 & 612 & 16% & 7% & 19% & 2%\\

1013%% ’’ & 16 & 2 & n/a & n/a & 4.65 & 600 & 27% & 11% & 19% & 4%\\

1014%% ’’ & 32 & 2 & n/a & n/a & 4.64 & 602 & 49% & 20% & 21% & 9%\\

1015%% \specialrule{0.25pt}{1pt}{1pt}

1016%% gmem/multi/s & 8 & 2 & n/a & n/a & 4.77 & 611 & 5% & 3% & 13% & 1%\\

1017%% ’’ & 16 & 2 & n/a & n/a & 4.92 & 608 & 6% & 3% & 12% & 1%\\

1018%% \specialrule{0.25pt}{1pt}{1pt}

1019%% gmem/multi/d & 8 & 2 & n/a & n/a & 4.85 & 609 & 15% & 7% & 18% & 2%\\

1020%% ’’ & 32 & 2 & n/a & n/a & 4.89 & 609 & 47% & 19% & 20% & 9%\\

1021%% \specialrule{0.25pt}{1pt}{1pt}

1022%% jit/mono/s & 2 & n/a & n/a & 2 & 5.07 & 602 & 5% & 3% & 12% & 0%\\

1023%% ’’ & 4 & n/a & n/a & 4 & 4.64 & 611 & 5% & 3% & 12% & 0%\\

1024%% ’’ & 8 & n/a & n/a & 8 & 4.41 & 596 & 6% & 3% & 12% & 1%\\

1025%% ’’ & 16 & n/a & n/a & 16 & 4.58 & 614 & 7% & 3% & 12% & 2%\\

1026%% \specialrule{0.25pt}{1pt}{1pt}

1027%% jit/mono/d & 8 & n/a & n/a & 8 & 4.73 & 612 & 16% & 6% & 19% & 2%\\

1028%% \specialrule{0.25pt}{1pt}{1pt}

1029%% jit/multi/s & 8 & n/a & n/a & 8 & 4.67 & 610 & 7% & 3% & 15% & 1%\\

1030%% \specialrule{0.25pt}{1pt}{1pt}

1031%% jit/multi/d & 4 & n/a & n/a & 4 & 5.27 & 604 & 11% & 5% & 24% & 1%\\

1032%% ’’ & 8 & n/a & n/a & 8 & 5.17 & 604 & 17% & 7% & 24% & 2%\\

1033%% \specialrule{0.25pt}{1pt}{1pt}

1034%% horiz/mono/s & 32 & n/a & 16 & 32 & 0.81 & 605 & 10% & 4% & 70% & 4%\\

1035%% ’’ & 64 & n/a & 16 & 64 & 0.87 & 577 & 14% & 6% & 72% & 7%\\

1036%% \specialrule{0.25pt}{1pt}{1pt}

1037%% horiz/mono/d & 16 & n/a & 16 & 16 & 0.84 & 602 & 27% & 11% & 77% & 5%\\

1038%% ’’ & 32 & n/a & 16 & 32 & 0.81 & 597 & 48% & 19% & 76% & 9%\\

1039%% \specialrule{0.25pt}{1pt}{1pt}

1040%% horiz/multi/s & 32 & n/a & 16 & 32 & 1.30 & 609 & 10% & 5% & 72% & 4%\\

1041%% \specialrule{0.25pt}{1pt}{1pt}

1042%% horiz/multi/d & 16 & n/a & 16 & 32 & 0.81 & 575 & xx% & xx% & xx% & x%\\

1043%% \bottomrule

1044%% \end{tabular}

1045%% \caption{Performance of some configurations of our simulators (Old make new!)}

1046%% \label{tbl:perf2}

1047%% \end{table*}

1048

1049\begin{table*}

1050 \centering

1051 \footnotesize

1052 \begin{tabular}{crrrrrrrrrrrr}

1053 \toprule

1054 \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{LC} & \textbf{RTF} & \textbf{Freq.} & \textbf{ALUT} & \textbf{Reg.} & \textbf{ALM} & \textbf{M20K} & \textbf{DSP}\\

1055 \midrule

1056 gmem/mono/s & 8 & 2 & n/a & n/a & n/a & 4.79 & 601 & 126k & 313k & 21% & 23% & 1%\\

1057 ’’ & 64 & 2 & n/a & n/a & n/a & 4.93 & 585 & 155k & 391k & 26% & 24% & 2%\\

1058 \specialrule{0.25pt}{1pt}{1pt}

1059 gmem/mono/d & 8 & 1 & n/a & n/a & n/a & 5.38 & 608 & 141k & 371k & 24% & 27% & 2%\\

1060 ’’ & 8 & 2 & n/a & n/a & n/a & 4.71 & 608 & 147k & 394k & 25% & 28% & 2%\\

1061 ’’ & 16 & 2 & n/a & n/a & n/a & 4.61 & 605 & 182k & 477k & 30% & 28% & 4%\\

1062 \specialrule{0.25pt}{1pt}{1pt}

1063 gmem/multi/s & 4 & 4 & n/a & n/a & n/a & 4.91 & 600 & 122k & 294k & 20% & 22% & 0%\\

1064 ’’ & 64 & 1 & n/a & n/a & n/a & 5.65 & 585 & 128k & 340k & 22% & 22% & 6%\\

1065 \specialrule{0.25pt}{1pt}{1pt}

1066 gmem/multi/d & 8 & 1 & n/a & n/a & n/a & 5.52 & 609 & 139k & 365k & 23% & 26% & 2%\\

1067 ’’ & 8 & 2 & n/a & n/a & n/a & 4.86 & 605 & 145k & 360k & 24% & 27% & 2%\\

1068 \specialrule{0.25pt}{1pt}{1pt}

1069 horiz/mono/s & 32 & n/a & 16 & 32 & n/a & 0.81 & 608 & 127k & 329k & 22% & 79% & 4%\\

1070 ’’ & 64 & n/a & 16 & 16 & n/a & 0.85 & 604 & 143k & 375k & 25% & 80% & 7%\\

1071 \specialrule{0.25pt}{1pt}{1pt}

1072 horiz/mono/d & 4 & n/a & 8 & 4 & n/a & 1.74 & 609 & 122k & 333k & 21% & 55% & 1%\\

1073 ’’ & 32 & n/a & 16 & 32 & n/a & 0.82 & 601 & 258k & 653k & 42% & 85% & 9%\\

1074 \specialrule{0.25pt}{1pt}{1pt}

1075 horiz/multi/s & 4 & n/a & 8 & 4 & n/a & 1.73 & 610 & 108k & 287k & 19% & 51% & 1%\\

1076 ’’ & 16 & n/a & 16 & 16 & n/a & 0.81 & 605 & 123k & 330k & 22% & 80% & 2%\\

1077 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 601 & 133k & 333k & 23% & 81% & 4%\\

1078 \specialrule{0.25pt}{1pt}{1pt}

1079 horiz/multi/d & 4 & n/a & 8 & 4 & n/a & 1.79 & 583 & 126k & 319k & 22% & 57% & 1%\\

1080 ’’ & 16 & n/a & 16 & 16 & n/a & 0.80 & 607 & 187k & 492k & 32% & 85% & 5%\\

1081 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 600 & 265k & 664k & 43% & 86% & 9%\\

1082 \specialrule{0.25pt}{1pt}{1pt}

1083 jit/mono/s & 16 & n/a & n/a & 16 & 16 & 1.47 & 590 & 116k & 319k & 22% & 79% & 7%\\

1084 \specialrule{0.25pt}{1pt}{1pt}

1085 jit/mono/d & 4 & n/a & n/a & 4 & 8 & 2.23 & 611 & 122k & 317k & 21% & 56% & 2%\\

1086 ’’ & 16 & n/a & n/a & 16 & 16 & 1.44 & 597 & 182k & 478k & 32% & 85% & 10%\\

1087 ’’ & 32 & n/a & n/a & 32 & 16 & 1.50 & 584 & 268k & 704k & 44% & 85% & 20%\\

1088 \specialrule{0.25pt}{1pt}{1pt}

1089 jit/multi/s & 16 & n/a & n/a & 16 & 16 & 1.43 & 593 & 120k & 325k & 22% & 79% & 7%\\

1090 \specialrule{0.25pt}{1pt}{1pt}

1091 jit/multi/d & 4 & n/a & n/a & 4 & 8 & 2.21 & 600 & 125k & 317k & 22% & 55% & 2%\\

1092 ’’ & 8 & n/a & n/a & 8 & 16 & 1.56 & 602 & 147k & 393k & 26% & 84% & 5%\\

1093 ’’ & 16 & n/a & n/a & 16 & 16 & 1.34 & 606 & 186k & 492k & 33% & 85% & 10%\\

1094 \bottomrule

1095 \end{tabular}

1096 \caption{Speed and resource usage for some simulators. The table

1097 includes each family’s fastest simulator and some

1098 others for comparison purposes.}

1099 \label{tbl:perf3}

1100\end{table*}

1101

1102\begin{figure}

1103 \center

1104 \includegraphics[width=\linewidth]{update_widths}

1105 \caption{RTF as a function of update width for the fastest gmem/mono/s

1106 (blue), gmem/mono/d (gold), gmem/multi/s (green), and gmem/multi/d

1107 (red) simulators. Update widths larger than four does not decrease

1108 RTF.}

1109 \label{fig:uwidth}

1110\end{figure}

1111

1112\Cref{tbl:perf3} presents the performance and resource usage of our

1113fastest simulator configuraions. The first column shows the

1114simulator’s family in slash-notation (see \cref{sec:impl}). The next

1115five its parameters; \textit{update width} (UW), \textit{synapse

1116 unroll} (SU), \textit{horizon length} (H), \textit{synapse classes}

1117(SC), and \textit{lane count} (LC). The following column shows its

1118real-time factor (RTF), defined as the time taken to run the

1119simulation -- wall-clock time -- divided by the duration of the

1120simulated biological time (10 seconds). We measure the wall-clock time

1121as the time from the first \verb!clEnqueueNDRangeKernel! call until the

1122final simulation result can be read from the FPGA’s memory. As the RTF

1123does not vary beyond two decimal places even on runs on networks

1124initialized with different random seeds, we just report its mean. The

1125next columns shows the simulator’s operating frequency in MHz and FPGA

1126resource usage. \Cref{fig:rtf} plots the RTF of the fastest

1127simulators from every family as a bar chart.

1128

1129A startling finding is that there is no significant performance

1130difference between double and single precision neuron state and

1131between multi- and single-kernel implementations. The fastest

1132horiz/mono/s and horiz/multi/d simulators RTF is 0.81 and 0.79 which

1133is very close to each other. Presumably, the spike transfer phase is

1134much more expensive than the neuron update phase so making the latter

1135run faster, either by using single-precision or by overlap** the

1136update and transfer phases, does not improve performance. For the same

1137reason, increasing the gmem simulator’s \textit{update width} is

1138ineffective, as \cref{fig:uwidth} shows. Doubling the \textit{update

1139 width} roughly halves the number of cycles spent updating neuron

1140state, but even that is insignificant. The performance even detoriates

1141for \textit{update width} larger than 16. Likely because the

1142FPGA’s DDR interface is 512 bits wide.

1143

1144Neither the multi- nor single-kernel gmem simulators benefit from

1145more \textit{synapse unroll}. The reason could be because

1146the unrolled versions of the loop contains a false memory

1147dependency. There is no way of letting the OpenCL compiler know

1148that two iterations of the loop body -- \verb!W[t + d, j] += w! --

1149writes to distinct memory locations so the compiler refuses to

1150schedule multiple writes per clock cycle.

1151

1152The jit and horiz simulators are markedly faster than the gmem

1153simulators because they transfer spikes in on-chip memory. The best

1154gmem simulator has an RTF of 4.61 while it is 1.50 for the best jit

1155simulator and 0.79 for the best horiz simulator. The latter simulators

1156also uses up to 90% of the on-chip memory for storing the horizon and

1157lane buffers. The results suggest that the larger these buffers are

1158the better the performance. Unfortunately, our FPGA can not fit

1159horizons longer than 16 time steps or more than 16 lanes. The horiz

1160simulators’ performance edge over the jit simulators is due to them

1161running the spike transfer loop fewer times. The jit simulators run it

1162$d_\mathrm{max}$ times (i.e. 64) for every spiking neuron, while the

1163horizon simulators only runs it $d_\mathrm{max}/h$ times (i.e.,

116464/16=4 times). The horiz simulators also does not have to sum the

1165incoming current from multiple lanes when updating the neurons state.

1166

1167\begin{figure*}

1168 \center

1169 \includegraphics[width=\linewidth]{synapse_classes}

1170 \caption{The horizon and JIT algorithms’ RTF as a function of the

1171 number of \textit{synapse classes}. The \textit{horizon length}

1172 and \textit{lane count} parameters are both 16. The four colors

1173 represent the four implementation styles.}

1174 \label{fig:synapse_classes}

1175\end{figure*}

1176

1177\Cref{fig:synapse_classes} plots RTF as a function of the number of

1178\textit{synapse classes} for the horiz and jit simulators. For both,

1179performance improves until the parameter reaches 32. The more classes,

1180the more synapses can be activated in parallel which, clearly, is

1181important for performance. The drawback of increasing the number of

1182classes is memory waste. Especially for the jit simulators which

1183iterate many fewer times per spike transfer loop. With 16 classes the

1184average occupancy is only about 74%. It decreases to 47% with 32

1185classes, meaning that the simulator accesses more than twice as much

1186data than it needs. Interestingly, there is no performance

1187penalty. The benefits of handling many synapses in parallel is worth

1188lots of extra off-chip memory reads. Perhaps since we store the synapses

1189in contiguous memory reading them is quite cheap.

1190

1191All simulators run at around 600 MHz, considered very good for HLS. In

1192our experience, operating frequencies much lower than that were

1193caused by inefficiently banked on-chip memory, unaccounted for

1194loop-carried dependencies, or similar issues.

1195

1196%% The fastest horiz and jit simulators have a \textit{horizon length}

1197%% and \textit{lane count} of 16.

1198

1199%% Both the horiz and jit simulators transfer spikes through local memory

1200%% and have a large performance advantage compared to the gmem

1201%% simulators. These simulators use between 80 to 90% of local memory

1202%% and would surely benefit from having even more available. A \textit{horizon length

1203

1204%% Doubling the

1205%% \textit{horizon length} would require twice as much local memory, but

1206%% would cut the number of times the neuron’s synapses had to be accessed

1207%% in half.

1208

1209%% The performance of the gmem simulators is comparable to that of the

1210%% lmem/jit simulators, even though the former transmit spikes through

1211%% global memory. As they activate all of the spiking neuron’s synapses

1212%% in one go -- rather than spreading out the activations over multiple

1213%% time steps -- they are simpler and require less bookkee** than their

1214%% local-memory-using counterparts. Likely, this is what makes them

1215%% competitive. Neither the monolithic or multi-kernel versions of the

1216%% simulators benefit from having their spike propagation loops unrolled

1217%% more than four times because the unrolled version of the loop contains

1218%% a false memory dependency. As there is no way of letting the OpenCL

1219%% compiler know that two iterations of the loop body --

1220%% \verb!W[t + d, j] += w! -- always writes to distinct memory addresses

1221%% the compiler refuses to schedule multiple writes per cycle.

1222

1223

1224

1225%% The results also show that the multi-kernel implementations are less

1226%% efficient than the single-kernel ones. While channels are low-overhead

1227%% on Intel FPGAs, the overhead is still present and offsets whatever

1228%% benefit one gets from running the algorithm’s phases in

1229%% parallel. Presumably, because spike propagation is by far the most

1230%% expensive phase. This is in concordance with another observation; the

1231%% speed difference between single and double precision neuronal state is

1232%% negligible. Performance is not limited by compute resources, but by

1233%% memory bandwidth. However, double precision still carries a penalty in

1234%% that we have to use twice as much block memory to store the neuronal

1235%% state. The number of DSPs required is also larger than for

1236%% single-precision implementations.

1237

1238%% The lmem/horiz simulators are the fastest by a wide margin. As the

1239%% performance difference between the mono and multi variants indicate,

1240%% they benefit greatly from running the update and propagation phases in

1241%% parallel. This is because they use local memory to transmit spikes so

1242%% the propagation phase is, relatively speaking, cheaper and the neuron

1243%% update phase costlier. Thus, the benefit of running the phases in

1244%% parallel is bigger. Moreover, the usage of channels means that kernels

1245%% do not have to produced buffers of spiking neurons that are then

1246%% consumed in a future phase. Instead, the kernels can send spiking

1247%% neurons to the consuming kernel directly. All kernels of this family

1248%% benefit from larger horizons. Larger horizons mean that more synapses

1249%% can be activated simultaneously which is good for parallelization. It

1250%% also meshes well with larger \textit{synapse classes} -- the more

1251%% synapses per class the less memory waste. Unfortunately, the board’s

1252%% block ram can not fit horizons larger than 16.

1253

1254\begin{figure}

1255 \center

1256 \includegraphics[width=\linewidth]{rtf_bars}

1257 \caption{RTF of the fastest simulator of each type. The four colors

1258 represent the four implementation styles.}

1259 \label{fig:rtf}

1260\end{figure}

1261

1262

1263\subsection{Energy Usage}

1264

1265We use the Terasic Dashboard GUI to measure the energy usage of our

1266fastest simulator -- horiz/multi/s with the parameters H=16, UW=32,

1267and SC=32. The dashboard connects to the Agilex 7 via a MAX 10 device

1268that continuously monitors the voltage and current rails going into

1269the board and the FPGA itself \citep{dkug}, allowing us to measure

1270both a ‘‘pessimistic’’ power consumption for the whole board (which

1271includes unused peripherals that draw power) and an ‘‘optimistic’’ one

1272for only the FPGA fabric. Due to the device’s low sampling rate we

1273compute the total energy usage as the product of the maximum power

1274draw and the simulation time. According to our measurements, the

1275simulator pessimistically requires 44.9 W $\cdot$ 8.13 s = 101 mWh and

1276optimistically 16.3 W $\cdot$ 8.13 s = 37 mWh to simulate 10 seconds

1277of biological time. The pessimistic energy per synaptic event (metric

1278defined in \cite{vanalbada2018}) is 21 nJ and the optimistic one 9

1279nJ. These values are upper bounds and the actual energy usage may be

1280lower.

1281

1282\section{Discussion}

1283\label{sec:disc}

1284

1285Our main contribution is the presentation and analysis of methods for

1286creating FPGA-based simulators competitive in speed and energy usage

1287with the state-of-the-art in SNN simulation. We hope that others will

1288create even faster simulators by adopting and refining our algorithms

1289and implementation techniques for their hardware. Some methods are

1290endogenous to our hardware. For example, by running multiple kernels

1291in parallel that communicate with each other via channels, we overlap

1292different phases of the simulation algorithm. This technique has no

1293direct GPU-equivalent as different kernel types cannot communicate. It

1294also relies on the Intel-specific channel extension and may not work

1295well even on other vendors’ FPGAs. On the other hand, the technique

1296for interleaving synapses to improve banking and reduce conflicts

1297(\cref{sec:disjoint}) is adaptable to non-FPGA hardware. The same goes

1298for the horizon-based transfer algorithm which should suit any device

1299with enough fast local memory.

1300

1301Push-based SNN simulation mandates irregular memory accesses and has

1302low computational intensity; qualities that it shares with many other

1303graph processing problems. This is evidenced by the fact that most of

1304our simulators use less than 10% of the FPGA’s DSP resources, while

1305many consume over 80% of its on-chip memory. More or even faster

1306arithmetic resources would be useless. However, more on-chip

1307memory would be very beneficial because we could use the basic spike

1308transfer algorithm and would not need to bother with the more complex

1309horizon algorithm. And higher bandwidth and lower latency memories

1310would allow the simulators to transmit spikes faster. This finding

1311implies that simulators would probably achieve excellent performance

1312with neuron models more complex than LIF -- as long as the

1313interactions between neurons remain the same and as long as their

1314memory consumption does not grow.

1315

1316\subsection{FPGAs for HPC}

1317

1318We believe that FPGAs have unique advantages for HPC; they can be

1319tailored for the problem at hand thanks to their

1320malleability. However, FPGAs and particularily \textit{tools for

1321 designing for FPGAs} have many glaring weaknesses. Unlike compilers

1322for software, which emit the same machine code for the same source

1323code, synthesizers use optimization algorithms based on randomness so

1324good performance is contingent on choosing lucky random seeds. Even if

1325the differences between lucky and unlucky seeds are small, the

1326randomness makes squeezing out the last few percent of performance

1327frustrating. An algorithmic change or parameter tuning causing a small

1328performance improvement may be due to chance. Moreover, a single

1329synthesis can take several hours even on top-of-the-line hardware,

1330compounding this problem. Writing performance-critical code is an

1331experimental process, wherein one needs to test hundreds or thousands

1332of ideas to see what yields the best performance. Long turn-around

1333times slows the process down. Simulators do not help as their

1334performance does not reflect the performance of real hardware. For

1335these reasons, FPGAs are decidedly more complex to develop for than

1336GPUs.

1337

1338What is a good distribution of FPGA resources for HPC? For us, our

1339board’s distribution is far from perfect. Our fastest designs used

1340almost all on-chip memory, with plenty of ALUTs, FFs, and DSPs to

1341spare. If our use-case and designs are close to the norm then it would

1342be wise for FPGA vendors to trade-off logic resources for more on-chip

1343memory. And trends in the HPC field indicate that memory -- not

1344arithmetic -- very much is a limiting factor for many

1345problems. However, inevitably, whatever resource distribution the

1346vendor chooses, it will not be ideal for some designs. Optimizing an

1347FPGA so that as many designs as possible can take advantage of as much

1348of its resources as possible seems exceedingly difficult.

1349

1350\subsection{HLS for HPC}

1351

1352In this work we choose HLS over traditional design methods because of

1353its purported productivity advantages. How much performance did HLS

1354cost us and how much longer would it have taken us to implement the

1355simulators in an HDL? The question is an instance of the classic

1356dichotomy between performance and productivity that appears in many

1357corners of computer science. The literature suggests that in general

1358the productivity gains of HLS are large and the performance losses are

1359either non-existent or low \citep{lahti2018, pelcat2016}. However, one

1360can ask whether this holds true for performance-critical design?

1361

1362HLS definitely allowed us to explore algorithmic ideas at a rapid

1363pace. In particular, scheduling stallable loops and interfacing with

1364DDR would have been at least an order of magnitude more work to

1365implement (and debug!) in an HDL. It also helped that most -- but not

1366all -- of the OpenCL code we wrote were runnable verbatim on non-FPGA

1367targets. However, for performance judicious use of compiler

1368directivies is essential, something we struggled with. For example, adding

1369\verb!#pragma disable_loop_pipelining! on a loop in a gmem simulator

1370increased its operating frequency by almost 200 MHz since the compiler

1371could deduce that a memory port would not be shared across loop

1372iterations. On the other hand, removing the directive on a loop in a

1373horiz simulator more than doubled its performance! On several

1374occassions, misplaced \verb!#pragma ivdep! directives caused bugs that

1375were difficult to troubleshoot. Having to ‘‘nudge’’ the compiler via

1376directives to make the right decisions made us feel like we were not

1377fully in control and was at times frustrating.

1378

1379As we have no baseline to compare with, estimating the performance

1380cost of HLS is difficult. Our designs run at over 600 MHz which --

1381while a fair bit lower than the theoretical limit -- is in the upper

1382range of what typical Agilex 7 HDL designs runs at. If we are correct

1383in that performance is mostly bounded by memory resources, then

1384non-optimal performance is due to algorithmic choices and not due to

1385the choice of implementation language.

1386

1387\subsection{Future Research}

1388

1389\begin{figure}

1390 \center

1391 \includegraphics[width=\linewidth]{delays}

1392 \caption{Synaptic delay distributions for excitatory (blue) and

1393 inhibitory synapses (gold). Almost all probability mass lies to

1394 the left of 32 (dashed line).}

1395 \label{fig:delays}

1396\end{figure}

1397

1398Time constraints forced us to leave many ideas for better performance

1399unexplored. We list some of them here.

1400

1401Our results demonstrate that single-precision floating-point is

1402sufficient. Half-precision or some 16-bit fixed-point representation

1403could also be sufficient. If so, four bytes would be enough to

1404represent synapses which would -- more or less -- cut our simulator’s

1405memory demand in half.\footnote{Assuming we store some bits of the

1406delay and destination neuron index implicity in the index.} And, as the

1407synapses current is sampled from Gaussians with known means, the

1408representation’s size could perhaps be reduced further by storing the

1409current as \textit{deviations from the mean}. Presumably, fewer bits are

1410needed to represent the deviations accurately since they are rather

1411small.

1412

1413Along the same lines, one could exploit the fact that, as

1414\cref{fig:delays} shows, synaptic delays are not uniformly

1415distributed. Something our algorithms are oblivious to. It means, for

1416example, that the gmem simulators needto use off-chip memory for the

1417spike transfer array. One could perhaps retain the spike transfer

1418array for ‘‘fast’’ synapses with delays shorter than, say, 32 and

1419another more space-efficient format for ‘‘slow’’ synapses with longer

1420delays.

1421

1422As we have emphasized, we cannot safely transfer spikes from two

1423neurons simultaneoulsy as they may write to the same memory

1424locations. But this is only true if they have synapses with the same

1425destination neuron and delay. We can apply the idea from

1426\cref{sec:disjoint} on the ‘‘neuron level’’ and partition the

1427neurons into disjoint classes so that synapses of neurons of

1428different classes never trigger writes to the same memory locations.

1429

1430Our FPGA has four DDR memories each equipped with a 512 bit read/write

1431port. We are likely not utilizing all bandwidth properly since we use

1432the compiler’s default off-chip memory organization. Plausibly, we

1433could see a substantial performance improvement by carefully laying

1434out off-chip memory manually. For example, by distributing the

1435synaptic data (by far the largest data structure) over the four

1436memories.

1437

1438Another line of research could be to deploy a cluster of FPGAs for SNN

1439simulation. As the communication between the devices likely becomes

1440the bottleneck, it is a much different problem from using one device.

1441

1442\subsection{State-of-the-art}

1443\label{sec:sota}

1444\begin{table*}

1445 \centering

1446 \footnotesize

1447 \begin{tabular}{lllrrr}

1448 \toprule

1449 \textbf{Work} & \textbf{Simulator} & \textbf{Hardware} & \textbf{Node} & \textbf{RTF} & \textbf{Syn.Ev En.}\\

1450 \midrule

1451 \textbf{This work} & \textbf{horiz/mono/s} & \textbf{1 Agilex 7 FPGA} & \textbf{10} & \textbf{0.81} & \textbf{25}\\

1452 \cite{heittmann2022} & IBM INC-3000 & 432 Xilinx XC Z7045 SoC & 28 & 0.25 & 783\\

1453 \cite{kauth2023b} & neuroAIx & 35 NetFPGA SUME & 28 & 0.05 & 48\\

1454 \cite{golosio2021} & NeuronGPU & 1 GeForce RTX 2080 Ti & 12 & 1.06 & 180\\

1455 \cite{golosio2021} & NeuronGPU & 1 Tesla V100 & 12 & 1.64 & -\\

1456 \cite{knight2018} & GeNN & 1 GeForce RTX 2080 Ti & 12 & 1.40 & -\\

1457 \cite{knight2018} & GeNN & 1 Tesla V100 & 12 & 2.16 & 470\\

1458 \cite{vanalbada2018} & SpiNNaker & 217 ASIC & 130 & 20.00 & 5900\\

1459 \cite{rhodes2020} & SpiNNaker & 318 ASIC & 130 & 1.00 & 600\\

1460 \cite{kurth2022} & NEST & 2 AMD EPYC Rome & 14 & 0.53 & 480\\

1461 \bottomrule

1462 \end{tabular}

1463 \caption{State-of-the-art in hardware-accelerated microcircuit

1464 simulation. Node refers to the technology node in nanometers, RTF

1465 how much slower than realtime the simulator runs (lower is

1466 better), and Syn.Ev En. estimated energy consumption per synaptic

1467 event in nano-Joule.}

1468 \label{tab:power}

1469\end{table*}

1470\begin{figure}

1471 \center

1472 \includegraphics[width=\linewidth]{competition}

1473 \caption{RTFs of some SNN simulators}

1474 \label{fig-rtf}

1475\end{figure}

1476\begin{figure}

1477 \center

1478 \includegraphics[width=\linewidth]{energy}

1479 \caption{Energy per synaptic event of some SNN simulators}

1480 \label{fig-energy}

1481\end{figure}

1482

1483While we are far from the first to have investigated FPGA-based SNN

1484simulation, our investigation is qualitatively different from most

1485others. First, our focus is accurate SNN simulation and not solving a

1486specific

1487task. \cite{gupta2020,han2020,li2021,zheng2021,carpegna2022,liu2023,carpegna2024}

1488all implement layered SNNs that excel at classifying images from the

1489MNIST dataset. For them classification accuracy is much more important

1490than simulation speed or energy consumption. Second, many FPGA

1491simulators simulate SNNs much smaller than the Potjans-Diesmann

1492microcircuit or uses non-LIF neurons, making their results

1493incomparable to ours. For example, \cite{pani2017} simulates up to 1,440

1494Izhikevich neurons in real-time and \cite{shama2020} simulates 150

1495Hodgkin-Huxley neurons on a Virtex-2 FPGA. In contrast, NeuroFlow by

1496\cite{cheung2016} simulates up to 600,000 neurons on a cluster of six

1497FPGAs four times slower than real-time. However, it uses a

14981 ms timestemp (ten times larger than the \textit{de facto} standard

14990.1 ms), organizes neurons into two-dimensional grids, and spatially

1500constrains synapses. That is, most synapses connect to the neuron’s

1501nearest neighbours. \cite{trensch2022} proposes a simulator for a

1502Zynq-7000 SoC and measure its performance on a network of 800

1503excitatory and 200 inhibitory Izhikevich neurons. Their architecture

1504divides the FPGA into 16 identical processing blocks, each supporting

150564 neurons or 1,024 neurons in total per chip. They argue that if a

1506cluster of their devices were deployed, it could simulate the

1507Potjans-Diesmann microcircuit seven times faster than

1508real-time. Though they did not test their claim.

1509

1510The simulators we think are the most similar to ours are: IBM INC-3000,

1511neruoAIx, NeuronGPU, GeNN, SpiNNaker, and NEST. IBM-INC 3000 and

1512neuroAIx are both FPGA clusters, consisting of 432 Xilinx XC boards

1513and 35 NetFPGA SUME boards respectively. IBM-INC 3000 is four

1514times faster than real-time and neuroAIx twenty time faster than

1515real-time \citep{kauth2023,heittmann2022}. GeNN and NeuronGPU are both

1516GPU-based simulators written in CUDA and runs on NVIDIA’s line of GPUs

1517\citep{knight2018,golosio2021}. On a Tesla V100 GPU, GeNN simulate

1518the microcircuit at about half the speed of real-time and NeuronGPU

15195% slower than real-time. SpiNNaker is an older dedicated

1520neuromorphic hardware system built out of hundreds of ASICs

1521\citep{furber2013}. NEST is a CPU-based simulator which runs

1522twice the speed of real-time on two AMD EPYC Rome nodes

1523\citep{kurth2022}.

1524

1525\Cref{tab:power,fig-rtf,fig-energy} show how our best simulator stack

1526up against the competition. The energy usage of our fastest

1527simulator compares favorably to other results. Likely

1528partially due to to the Agilex 7’s smaller technology node and to us

1529confining our implementation to one FPGA. \cite{kauth2023b} report a

1530much lower RTF than us, but use 35 boards, each drawing 26.54 W on

1531average, resulting in a higher total energy usage for the same amount

1532of biological time. They, like us, report an ‘‘all-inclusive’’ value

1533for the energy usage so a dedicated platform -- without any unused

1534peripherals -- could consume much less energy. It goes without

1535saying that fairly comparing performance of systems implemented on

1536different architectures, with different design trade-offs, and

1537accuracy constraints is tricky. One system with excellent performance

1538may be inadequate in other regards. For example, less configurability

1539may improve a system’s performance, but make it unusable for certain

1540applications. It should be noted that most simulators, unlike ours,

1541replace thalamic spikes with DC input which limits their

1542applicability.

1543

1544

1545

1546

1547

1548

1549

1550

1551%% Fairly comparing performance of systems implemented on different

1552%% architectures, with different design trade-offs, and accuracy

1553%% constraints is tricky. One system with excellent performance may be

1554%% inadequate in other regards. For example, less configurability may

1555%% improve a system’s performance, but make it unusable for certain

1556%% applications. Nevertheless, \cref{tab:power,fig-rtf,fig-energy}

1557%% compare our best simulator’s RTF and energy per synaptic event

1558%% (defined in \cref{sec:results}) with that of some well-known

1559%% simulators from the literature. GeNN and NeuronGPU are two GPU-based

1560%% simulators written in CUDA and runs on NVIDIA’s line of GPUs

1561%% \citep{knight2018, golosio2021}. On a Tesla V100 GPU, GeNN simulate

1562%% the microcircuit at about half the speed of real-time and NeuronGPU

1563%% 5% slower than real-time. IBM INC-3000 and neuroAIx are two FPGA

1564%% clusters that run four times faster than real-time and twenty times

1565%% faster than real-time \citep{kauth2023, heittmann2022}. They consist

1566%% of 432 Xilinx XC SoC boards and 35 NetFPGA SUME boards,

1567%% respectively. SpiNNaker is an older dedicated neuromorphic hardware

1568%% system built out of hundreds of ASICs \citep{furber2013}. It should be

1569%% noted that, unlike ours, most of these simulators replace thalamic

1570%% spikes with DC input.

1571

1572%% As \cref{fig-energy} indicates, the energy usage of our best simulator

1573%% compares favorably to published results. This is likely partially due

1574%% to to the Agilex 7’s smaller technology node and to us confining our

1575%% implementation to one FPGA. \cite{kauth2023b} report a much lower RTF

1576%% than us, but uses 35 boards, each of which on average draws 26.54 W

1577%% which results in more energy use for the same amount of biological

1578%% simulation time. They, like us, report an ‘‘all-inclusive’’ value for

1579%% the energy usage so a dedicated platform -- without any unused

1580%% peripherals -- would conceivably use much less energy.

1581

1582

1583

1584

1585

1586

1587

1588

1589

1590%% In addition to the above presented simulators, a number of FPGA

1591%% implementations simulate relatively small networks, which we contend

1592%% is materially different from simulating large networks. The modular

1593%% and efficient FPGA design \cite{pani2017} created simulates networks

1594%% of up to 1,440 Izhikevich neurons in real-time with a time-step of 0.1

1595%% ms. However, their design assumed a fuly-connected connectome and did

1596%% not model multapses nor spike transmission over synapses. The

1597%% simulator created by \cite{shama2020} simulated a 150-neuron

1598%% Hodgkin-Huxley (HH) SNN on a Virtex-2 FPGA. While the HH neuronal

1599%% model is more complicated than the LIF model, the authors, like

1600%% \cite{pani2017}, did not account for spike transmission delays in

1601%% their simulator.

1602

1603%% Other implementations are not for simulation per se, but for executing

1604%% a specific task. Those include the layered FPGA-based

1605%% SNN-implementations presented in \cite{gupta2020, han2020, li2021,

1606%% zheng2021, carpagena2022, liu2023, carpegna2024} which classify

1607%% images from the MNIST dataset. They all use layered SNNs which are

1608%% materially different from disorganized ones.

1609

1610%% \section{State-of-the-art}

1611

1612%% \begin{table*}[t]

1613%% \centering

1614%% \footnotesize

1615%% \begin{tabular}{llr}

1616%% \toprule

1617%% \textbf{Article} & \textbf{Board} & \textbf{MNIST accuracy}\\

1618%% \midrule

1619%% \cite{zheng2021} & Xilinx ZCU102 & 90.53%\\

1620%% \bottomrule

1621%% \end{tabular}

1622%% \caption{SNNs for classification on one FPGA}

1623%% \end{table*}

1624

1625%% \begin{table*}

1626%% \centering

1627%% \footnotesize

1628%% \begin{tabular}{lllrr}

1629%% \toprule

1630%% \textbf{Article} & \textbf{Board} & \textbf{Neuron model} & \textbf{Neuron cnt.} & \textbf{Time step}\\

1631%% \midrule

1632%% \cite{pani2017} & Xilinx XC 6VLX240T & Izhikevich & 1,440 & 0.1 ms\\

1633%% \bottomrule

1634%% \end{tabular}

1635%% \caption{SNNs for simulation of small networks on one FPGA}

1636%% \end{table*}

1637

1638

1639%% \begin{table*}[t]

1640%% \centering

1641%% \footnotesize

1642%% \begin{tabular}{llllrrr}

1643%% \toprule

1644%% \textbf{Article} & \textbf{Hardware} & \textbf{Neuron} & \textbf{Connectome} & \textbf{Neurons} & \textbf{Synapses} & \textbf{Sim. Speed}\\

1645%% \midrule

1646%% \cite{han2020} & Single FPGA & LIF & ?? & ?? & ?? & ?\\

1647%% \cite{trensch2022} & Single FPGA & IZ & ?? & 1,000 & ?? & ??\\

1648%% \cite{carpegna2024} & Single FPGA & LIF & Layered & & & n/a\\

1649%% \bottomrule

1650%% \end{tabular}

1651%% \caption{Overview of simulators for disorganized SNNs}

1652%% \end{table*}

1653

1654%% In this section we gauge the state-of-the-art for SNN-simulation on

1655%% FPGAs and explain how it relates to the implementation we

1656%% propose. Characterizing the state-of-the-art is challenging due to

1657%% differences in hardware, network’s intended purpose, implementation

1658%% techniques, correctness constraints, simulation resolution,

1659%% etc. Performance is unlikely to scale linearly with network size; a

1660%% result from a small network can’t easily be compared with a result

1661%% from a larger network. For instance, memory bottlenecks can cause

1662%% catastrophic degradations in performance as memory demand

1663%% increases. And, as discussed in \cref{sec:topo}, topology is extremely

1664%% important so implementations for layered SNNs cannot be fruitfully

1665%% compared with simulators for more brain-like, disorganized SNNs even

1666%% when targeting the same hardware.

1667

1668%% Some simulators are designed for FPGA clusters, or other custom

1669%% large-scale architetures, which for obvious reasons makes it difficult

1670%% to compare their perfromance with simulators utilizing a single board

1671%% (such as ours). An example of an early simulator for FPGA clusters is

1672%% NeuroFlow \citep{cheung2016}. It simulated up to 600,000 neurons on

1673%% six FPGAs only about four times slower than real time. However, the

1674%% time step was set to 1 ms (ten times larger than the \textit{de facto}

1675%% standard 0.1 ms used in many other simulators), neurons were organized

1676%% into two-dimensional grids, and synapses were spatially

1677%% constrained. That is, most synapses connected a neuron with its

1678%% immediate north, south, east, or west neighbour. neuroAIx is a more

1679%% recent FPGA cluster-based simulator \citep{kauth2023}. It uses 35

1680%% NetFPGA SUME (SoC based on Virtex 7) nodes and simulates the

1681%% Potjans-Diesmann microcircuit 20 times faster than biological

1682%% realtime. Like many of its predecessors, it approximates thalamic

1683%% input as direct current. Another large-scale architecture is the

1684%% neural supercomputer IBM INC-3000 \citep{heittmann2022}. It uses a

1685%% whop** 432 Xilinx XC Z7045 SoC and simulates Potjans-Diesmann’s

1686%% microcircuit four times faster than realtime.

1687

1688

1689

1690%% For completeness sake, we also survey some GPU-based

1691%% SNN-simulators. GeNN presented in \cite{knight2018} simulates the

1692%% Potjans-Diesmann microcircuit at half the speed of real-time on a

1693%% Tesla V100 GPU. The GPU library NeuronGPU, presented in

1694%% \cite{golosio2021}, is capable of simulating the same network about

1695%% 5% slower than realtime.

1696

1697%% \cite{han2020} implemented an LIF SNN on an Xilinx ZC706 FPGA for

1698%% classifying images from the MNIST dataset. They found that the network

1699%% could classify 161 frames per second with an accuracy of 97%, only

1700%% drawing 0.477W of power. For updating neurons they used a novel hybrid

1701%% algorithm that was neither fully synchronous nor fully

1702%% asynchronous. They benchmarked their FPGA implementation against an

1703%% SNN implemented in PyTorch running on an NVIDIA P100 GPU. Since

1704%% asynchronous simulation is not suitable for GPUs, it used a

1705%% synchronous update algorithm instead. While the GPU implementation was

1706%% faster than the FPGA implementation it used almost ten times as much

1707%% power per image. As the focus of their research was classification

1708%% accuracy rather than \textit{simulation accuracy} it is not directly

1709%% comparable to ours. Furthermore, their network was an order of

1710%% magnitude smaller than ours, all their synapses were non-recurrent,

1711%% and their neurons organized into four fully-connected layers.

1712

1713%% \cite{trensch2022} designed a simulator for a Zynq-7000 SoC

1714%% device. They measured its performance on a two-population

1715%% Izhikevich-type neurons containing 800 excitatory and 200 inhibitory

1716%% neurons. Their architecture divides the available FPGA into 16

1717%% identical processing blocks which each supports 64 neurons. Thus, in

1718%% total it supports up to 1,024 neurons per chip. The authors argued

1719%% that if a cluster of their devices were deployed, it could simulate

1720%% the Potjans-Diesmann microcircuit seven times faster than

1721%% realtime. Though this claim was not tested. Unlike our simulator,

1722%% theirs do not account for thalamic input.

1723

1724

1725\section*{Author Contributions}

1726BAL and AP designed the study. BAL implemented the SNN framework and

1727performed the experiments. BAL and AP analyzed the results and

1728co-wrote the paper.

1729

1730\section*{Funding}

1731This work was supported by the Swedish Research Council through the

1732project ‘‘Building Digital Brains’’ (grant reference 2021-04579).

1733

1734\section*{Acknowledgments}

1735TODO

1736

1737

1738\section*{Supplemental Data}

1739 \href{http://home.frontiersin.org/about/author-guidelines#SupplementaryMaterial}{Supplementary Material} should be uploaded separately on submission, if there are Supplementary Figures, please include the caption in the same file as the figure. LaTeX Supplementary Material templates can be found in the Frontiers LaTeX folder.

1740

1741\section*{Data Availability Statement}

1742

1743All source code will eventually be available under a permissible Open

1744Source license.

1745

1746

1747\bibliographystyle{Frontiers-Harvard}

1748\bibliography{frontiers}

1749

1750\end{document}