1__kernel void mul_sd(
2 __global float *A,
3 __global float *B,
4 __global float *C,
5 int N) {
6 for (uint i = 0; i < N; i++)
7 C[i] = A[i] * B[i];
8}
9__kernel void mul_nd(
10 __global float *A,
11 __global float *B,
12 __global float *C,
13 int N) {
14 uint i = get_global_id(0);
15 if (i < N)
16 C[i] = A[i] * B[i];
17}\end{lstlisting}
18\caption{Single and multiple work item OpenCL kernels for element-wise vector product}
19\label{lst:kernels}
20\end{figure}
21
22\noindent In 2009 the Kronos Group published the first version of the
23OpenCL (Open Computing Language) standard \citep{munshi2009}. The goal
24was to replace all vendor-specific languages and application
25programming interfaces (APIs) with a common and portable standard for
26writing high-performance code across all kinds of
27devices. An OpenCL application (unlike one written in a competing
28technology such as CUDA) can run \textit{unmodified} on any device
29with a compliant OpenCL implementation. Today OpenCL now runs on
30millions of CPUs, GPUs, DSPs, and other computing devices.
31
32The OpenCL standard consists of three parts; a specification for a
33programming language, a host API, and a device API. The host API runs
34on the user’s main computing device (e.g., PC) and controls their
35accelerator (e.g., GPU), whose functionality is accesible through the
36device API. Thus, OpenCL prescribes both how the programmer
37should program the accelerator and how to manage it from an external
38device.
39
40The OpenCL language closely resembles C. It supports functions, loops,
41multiple variable scopes, aggregate data types, and many other
42programming constructs familiar to C programmers. A big difference
43from C is that OpenCL comes with three explicit pointer spaces;
44global, local, and private. These grant the programmer fine-grained
45control over data storage. What constitutes global, local, and
46private memory is device-specific. Generally the largest and
47slowest memory space is global, while the smallest and fastest is
48private. Global memory may be sized in gigabytes, while private memory
49may only be a few kilobytes. OpenCL has native support for SIMD
50types to make it easier to write algorithms for parallel
51hardware. Unlike C, OpenCL specifies the bit width of most builtin
52types -- an \verb!int! is always 32 bits and a \verb!long! 64 bits.
53
54OpenCL was designed with massively parallel architectures in mind
55(e.g., GPUs) and has builtin support for concurrency in the form of
56\textit{work items}. A work item is a small indivisible unit of work
57with, ideally, no dependencies to other work items. The OpenCL runtime
58can schedule independent work items to maximally exploit the
59hardware’s potential. Consider the kernels in listing
60\ref{lst:kernels} for computing the element-wise vector
61product. Launching the first kernel -- \verb!mul_sd! -- causes OpenCL
62to instantiate one kernel on one core which runs $N$ iterations of the
63for loop. Launching the second kernel, however, causes OpenCL to
64instantiate up to $N$ kernels, each of which runs on an available core
65and computes one iteration of the for loop. Since the second kernel
66runs more computations in parallel, it likely runs much
67faster. Sometimes the algorithm’s work cannot feasibly be separated
68into completely independent work items. For such situations, work
69items that must be synchronized can be organized into \textit{work
70 groups} which share local memory. However, synchronization is
71impossible between work groups.
72
73\subsection{OpenCL on Intel FPGAs}
74\label{sec:ocl-fpga}
75\begin{figure}
76 \begin{lstlisting}[style=opencl]
77for (uint i = 0; i < N; i++) {
78 B[i] = h(g(f(A[i])));
79}\end{lstlisting}
80 \caption{A loop-nest that could benefit from pipeline
81 parallelism. The functions \texttt{f}, \texttt{g}, and \texttt{h} are
82 assumed to be short and inlineable.}
83 \label{lst:para}
84\end{figure}
85\begin{figure}
86 \begin{lstlisting}[style=opencl]
87channel uint jobs
88 __attribute__ ((depth(512)));
89__kernel void consumer() {
90 while (true) {
91 uint job =
92 read_channel_intel(jobs);
93 if (!msg)
94 break;
95 process_job(job);
96 }
97}
98__kernel void producer(uint N)
99 for (uint i = 0; i < N; i++) {
100 uint job = create_job(i);
101 send_channel_intel(job);
102 }
103 send_channel_intel(0);
104}\end{lstlisting}
105 \caption{Producer-consumer kernels communicating via a channel}
106 \label{lst:chans}
107\end{figure}
108
109\noindent As FPGAs work very differently from processor-based
110hardware, Intel’s OpenCL implementation for FPGAs differs in important
111ways from other OpenCL implementations. One major difference concerns
112parallelism. CPUs and GPUs have multiple cores to run multiple
113computations in parallel. Thus, the workload of an OpenCL program
114organized into multiple work items can be mapped onto multiple
115cores. FPGAs do not have any cores in the usual sense and it is
116difficult for Intel’s OpenCL FPGA compiler to synthesize code
117structured around work items into efficient FPGA bitstreams. Instead,
118Intel recommends designers to structure OpenCL code as single-work
119item kernels, and to derive most of the parallelism from executing
120multiple loop iterations concurrently; a type of concurrency known as
121‘‘pipeline parallelism’’. For example, assume the three functions
122\verb!f!, \verb!g!, and \verb!h! in listing \ref{lst:para} are short,
123side-effect free, and represent computations that can be performed in
124less than one clock cycle. The FPGA compiler can create a three-stage
125pipeline for the loop, where combinational circuits for \verb!f!,
126\verb!g!, and \verb!h! constitute the pipelines three stages. This
127pipeline can process three iterations of the loop in parallel; while
128the \verb!h!-circuit processes data for the first, the
129\verb!g!-circuit processes data for the second, and the
130\verb!f!-circuit for the third iteration, and so on. While the latency
131of every iteration is three cycles, the throughput (initiation
132interval) is only one cycle and the latency for the loop as a whole is
133$N + 2$ since it takes 2 cycles to fill the pipeline. Though, this
134assumes that the latency of every function is fixed and
135predictable. If one function executes operations with variable
136latencies, such as memory accesses, the pipeline may need to
137stall. Moreover, pipeline parallelism cannot increase the throughput
138of the loop beyond one cycle -- for that one has to use other
139techniques.
140
141The FPGA compiler supports many compiler directives
142(‘‘pragmas’’) that can help it better optimize difficult
143loops or loops containing invariants it can not
144prove on its own. Two important directives are \verb!#pragma ivdep!
145and \verb!#pragma unroll N!. The first tells the compiler that the
146loop contains no loop-carried dependencies and therefore it
147can reorder the loop iterations as it pleases. The second tells it to
148duplicate the loop body $N$ times. This means that the resulting
149circuit will use $N$ times as many gates, but, potentially, run $N$
150times faster. Other compilers ignore these directives.
151
152The FPGA compiler extends OpenCL with syntax for declaring channels
153for inter-kernel communication. Channels resemble the pipes feature in
154OpenCL 2.0 which the FPGA compiler does not support. Channels are
155implemented in on-chip memory as first in first out (FIFO) buffers of
156the desired depths and are very fast. \Cref{lst:chans} shows a simple
157producer-consumer example, where one kernel calls \verb!send_channel_intel!
158to put jobs on the FIFO and the other calls \verb!read_channel_intel!
159to remove them. Both functions block if the FIFO is full or empty.
160
161%% \Cref{fig:opencl} shows an entry point (externally callable function)
162%% to an OpenCL program -- a \textit{kernel} -- that computes the
163%% element-wise product of two vectors. The keyword \verb!__kernel!
164%% denotes that the function \verb!mul_ab! is an entry point and
165%% \verb!__global! that the associated pointers reside in the global
166%% memory space. What constitutes global memory is device dependent, but
167%% is generally the device’s largest and slowest memory. It can be
168%% contrasted with the local memory (denoted by the keyword
169%% \verb!__local!) which is both considerably smaller and faster to
170%% access than the global memory.
171
172
173%% The kernel’s \verb!#pragma! lines are compiler-specific directives
174%% that hints to the compiler how to synthesize the code. This is useful
175%% if the code contains invariants the compiler cannot prove or if the
176%% code is difficult for the compiler to optimize. The first directive
177%% \verb!#pragma ivdep! tells the compiler that there is no loop-carried
178%% dependencies in the loop nest, allowing the compiler to reorder the
179%% loop iterations or execute them in parallel. The second directive
180%% \verb!#pragma unroll 4! instructs the compiler to unroll the loop four
181%% times. This means that the resulting circuit will use four times as
182%% much hardware, but will potentially run four times as fast. Note that
183%% these directives have no effect when the code is compiled with a
184%% compiler other than with Intel’s FPGA SDK for OpenCL compiler.
185
186%% As OpenCL is tailored to take advantage of massively parallel
187%% architectures it supports dividing work loads into work groups and
188%% work items.
189
190%% \subsection{Spiking Neural Network}
191
192%% The spiking neural network (SNN) we simulate is the 1 $\mathrm{mm}^3$
193%% model of the sensory cortex developed by \cite{potjans2014cell}. It
194%% consists of 77,169 neurons grouped into four layers, each of which is
195%% subgrouped into one excitatory and one inhibitory population. About
196%% 300 million synapses connect the neurons randomly with
197%% population-specific probability densities. The neurons’ membrane
198%% voltage, $V$, is updated according to the leaky integrate-and-fire
199%% (LIF) differential equation;
200%% \begin{equation}
201%% \tau_mU’(t) = -U(t) + R_mI_{\in}(t).
202%% \end{equation}
203%% where $\tau_m$ and $R_m$ represents the time and resistance of the
204%% neuron’s membrane, and $V_{rest}$ the neuron’s resting voltage. These
205%% are all the same for all neurons in the model.
206
207%% When the membrane’s voltage crosses the threshold $V_{thresh}$ the
208%% neuron emits a spike
209
210
211\section{SNN Simulation}
212\label{sec:ssn-sim}
213
214\begin{figure}
215 \begin{lstlisting}[style=pseudocode]
216for t in range(n_tics):
217 for i in range(n_neurons):
218 N[i] = update_neuron(N[i])
219 if spikes(N[i]):
220 Q = enqueue(Q, i, t)
221 for i in range(n_neurons):
222 c = collect(Q, i, t)
223 N[i] = update_psc(Q, c)
224 Q = dequeue(Q, i, t)\end{lstlisting}
225 \caption{Synchronous SNN simulation}
226 \label{lst:basic}
227\end{figure}
228
229\noindent Having reviewed SNNs, the Potjans-Diesmann microcircuit and
230our implementation tools, we now discuss methods for simulating SNNs
231efficiently. We explore SNN simulation in general before introducing
232the ideas and algorithms we use to optimize our simulators. The end
233result is a taxonomy of twelve simulator families, grouped by their
234algorithms and implementation styles. We illustrate most of our points
235with Pythonesque pseudo-code, extended with two keywords;
236\verb!parfor! and \verb!atomic!. The former for loops whose iterations
237are independent and therefore can be executed in parallel. The latter
238for operations that must execute as one indivisible unit. Text in
239\verb!{brackets}! explains what the pseudo-code should do at that
240point.
241
242Broadly speaking, SNN simulation can be categorized on whether it is
243synchronous (time-step**) or asynchronous
244(event-driven). Synchronous simulation updates the state of every
245simulated object at every tick of a clock, regardless of whether it is
246necessary or not \citep{brette2007}. Asynchronous simulation only
247updates simulated objects whey they receive external stimuli, i.e.,
248events. In an asynchronous SNN simulation, spike emission and
249reception constitute the events, since the state of a neuron at a time
250between two events can be calculated easily (and generally is
251unimportant). Hybrid strategies, with asynchronous updates for some
252parts of the SNN and synchronous updates for other parts, are
253possible.
254
255For SNN simulation, asynchronicity offers precision advantages. The
256neuron’s membrane potential only has to be recomputed when it receives
257spikes. If this happens only rarely the simulator can afford to use
258more sophisticated methods than (repeated) forward Euler to solve the
259LIF equation (\cref{eqn:lif}). Also, spikes can be sent and received
260at any time and do not have to be confined to a discrete grid. For
261example, a synchronous simulator with a 0.1 ms time steps may not be
262able to represent spikes sent at times that are not multiples of 0.1
263ms. Asynchronicity may also have scalability advantages as the
264neurons’ states do not have to be synchronized to a global
265clock. However, asynchronicity entails irregular computation and
266irregular memory accesses -- traits that are extremely undesirable on
267modern hardware. Furthermore, dense SNNs have many more synapses than
268neurons which results in cascading effects. One spiking neuron ‘‘wakes
269up’’ thousands of its neighbours, causing them to spike and in turn
270wake up thousands of their neighbours. On the whole, the event
271processing overhead may dominate over whatever computational savings
272not being bound by a global clock brings. For example,
273\cite{pimpini2022} presents a sophisticated asynchronous CPU-based SNN
274simulator that supports speculative execution so that future neuron
275states can be computed in advance and then rolled back if received
276spikes (‘‘stragglers’’) invalidates their predicted state. While the
277authors measured accuracy improvements, the performance was
278lackluster. For these reasons, most high-performance SNN simulators,
279including ours, are synchronous.
280
281Synchronous simulation splits the simulation task into two phases; one
282for updating neurons and one for transferring spikes. Listing
283\ref{lst:basic} shows the basic algorithm. It uses two data
284structures; an indexable data structure, \verb!N!, to stores the state
285of all neurons and a queue, \verb!Q!, to keeps track of spikes in
286flight. For every neuron the algorithm calls \verb!update_neuron! to
287update its membrane potential, presynaptic current, and refractoriness
288in accordance with \cref{eqn:mem,eqn:psc,eqn:ref}. If \verb!spikes!
289indicates that the neuron spikes, it enqueues the neuron’s index $i$
290and the current time step $t$ in \verb!Q!. In the next phase (line 6
291to 9) the algorithm calls \verb!collect! on the queue to aggregate
292current destined to the $i$:th neuron at time step $t$. The call to
293\verb!update_psc! adds the aggregated current to the neuron’s
294presynaptic current. Finally, the algorithm removes the current it
295just handled from the queue. In this listing we include the
296\verb!for t in range(n_tics)! loop which shows that the algorithm
297repeats the two phases \verb!n_tics! times. However, for brevity’s
298sake, we omit this outer loop in the following listings.
299
300The work required for updating the neurons’ state for one tick is in
301the order of $O(N)$, where $N$ is the number of neurons --
302i.e. linear. However, for transferring spikes it is $O(f\Delta
303tpN^2)$, where $p$ is the probability that two randomly chosen neurons
304are connected by one or more synapses, $f$ the average spiking rate,
305and $\Delta t$ the tick duration. So the transfer time is proportional
306to the network’s density and quadratic in $N$ -- i.e. for large $N$ it
307dominates. Furthermore, updating the membranes and presynapses
308require a handful of multiplications per neuron -- cheap on modern
309hardware -- whereas spike transfer requires expensive reads and writes
310to and from non-contiguous memory. Hence, we will focus on the
311transfer phase which is where synchronous simulators spend most of
312their time in the remainder of this section.
313
314\subsection{Pushing and Pulling}
315\label{sec:push-pull}
316\begin{figure}
317 \begin{lstlisting}[style=pseudocode]
318parfor i in range(n_neurons):
319 spikes = False
320 if R[i] == 0:
321 x = p22*U[i] + p21*I[i]
322 spikes = x >= U_thresh
323 if spikes:
324 x = 0
325 R[i] = t_ref_tics
326 U[i] = x
327 else:
328 U[i] = 0
329 R[i] -= 1
330 if spikes:
331 parfor j, d, w in syns_from(i):
332 atomic W[t + d, j] += w
333 I[i] = p11*I[i] + T[t, i]*wpsn
334 I[i] += W[t, i]\end{lstlisting}
335 \caption{One time step of push-based spike transfer}
336 \label{lst:push}
337\end{figure}
338
339\begin{figure}
340 \begin{lstlisting}[style=pseudocode]
341parfor i in range(n_neurons):
342 A[t, i] = False
343 if R[i] == 0:
344 x = p22*U[i] + p21*I[i]
345 if x >= U_thresh:
346 A[t, i] = True
347 x = 0
348 R[i] = t_ref_tics
349 else:
350 x = 0
351 R[i] -= 1
352 U[i] = x
353 s = 0
354 for j, d, w in syns_to(i):
355 if A[t - d, j]:
356 s += w
357 I[i] = p11*I[i] + T[t, i]*wpsn + s\end{lstlisting}
358 \caption{One time step of pull-based spike transmission}
359 \label{lst:pull}
360\end{figure}
361
362\noindent Researchers have identified ‘‘pushing’’ and ‘‘pulling’’ as
363two general strategies for designing algorithms for graph problems
364(\cite{besta2017}; \cite{grossman2018}; \cite{ahangari2023}). A
365push strategy transfers signals \textit{from a node to its neighbours}
366by writing to the neighbours incoming signal buffers. In contrast, a
367pull strategy transfers signals \textit{to a node from its neighbours}
368by swee** through the node’s neighbours and checking whether they
369have any signals to be delivered to the node. I.e., the receiver node
370has to ‘‘go and ask’’ its neighbours whether they have signals for
371them. SNN simulation is a graph problem involving node-to-node
372transfer of signals and it too can be characterized in terms of
373pushing and pulling. Listing \ref{lst:push} and \ref{lst:pull}
374illustrate the two strategies. In both listings, the values of the
375scalars \verb!p11!, \verb!p21!, and \verb!p22! come from
376\cref{eqn:p11p22,eqn:p21}. The arrays \verb!U!, \verb!I!, and \verb!R!
377contain the neurons’ membrane potentials, presynaptic potentials, and
378refractory counters. When the neuron’s membrane potential exceeds
379\verb!U_tresh! the neuron spikes and becomes refractory for
380\verb!t_ref_tics! tics. The expression \verb!T[t, i]! denote the
381number of thalamic spikes received by the $i$:th neuron at the $t$:th
382time step.
383
384The strategies differ in how they transfer spikes. The push strategy
385transfers them when \verb!spikes! indicates that a neuron spikes. It
386calls \verb!syns_from! to fetch an iterator of all synapses
387\textit{originating} from the $i$:th neuron. The three-tuples $(j, d,
388w)$ it retrieves represent the synapses; $j$ the index of the
389destination neuron, $d$ the delay in time steps, and $w$ the
390current. For every three-tuple it writes to an element of \verb!W!, a
391two-dimensional array buffering current to be delivered. The element
392\verb!W[t, i]! is the amount of presynaptic current the $i$:th neuron
393receives at the $t$:th time step. The \verb!syns_to! call in listing
394\ref{lst:pull} works exactly like the \verb!syns_from! call, except it
395returns all synapses \textit{terminating} at the $i$:th neuron and $j$
396in the three-tuples $(j, d, w)$ denotes the \textit{originating}
397neuron. Array elements \verb!A[t, i]! indicate whether the $i$:th
398neuron spikes at time $t$ or not. The size of the longest synapse
399delay is in practice bounded and small so both \verb!A! and \verb!W!
400can be implemented as statically sized wrap-around arrays -- a
401technique we cover in \cref{sec:sizing}. Note that every neuron can be
402handled independently so we use the \verb!parfor! construct for both
403algorithms’ outer loops.
404
405Pushing of synaptic current to their neighbours happens on line 16 and
40617 of the push strategy’s listing. Neurons spike infreqently, but when
407they do, the algorithm must retrieve all their synapses and write
408their current to the array \verb!W!. This part of the algorithm is
409performance-critical. First, the algorithm accesses all the neuron’s
410synapses which can be expensive, even if they are stored in contiguous
411memory. Second, the algorithm gathers and scatters data from and to
412uncorrelated indices of \verb!W!. These are costly operations since
413the accesses to \verb!W! cannot be coalesced or cached.\footnote{The
414memory addresses are too far apart for any coalescing or caching
415scheme to be efficient.} To make matters worse, at one time step
416multiple neurons can write to the same indices of \verb!W!. I.e.,
417\verb!W[t + d, j] += w! has to be executed atomically to avoid data
418races. As the model is densely connected, races are not uncommon and
419should be accounted for. Partition-awareness, as suggested by
420\cite{besta2017}, is not an option because of the density and
421randomness of the connections, making most of them remote and not
422local.
423
424\begin{figure}
425 \begin{lstlisting}[style=pseudocode]
426for i in range(n_neurons):
427 {Update neuron state as before}
428 if spiked:
429 Q = enqueue(Q, i)
430for i in contents(Q):
431 parfor j, d, w in syns_from(i):
432 W[t + d, j] += w
433Q = clear(Q)\end{lstlisting}
434\caption{Deferred push-based spike transfer}
435\label{lst:push2}
436\end{figure}
437
438The pull-based algorithm instead has the receiver neurons responsible
439or ‘‘pulling in’’ current. Spiking merely sets an element in \verb!A!
440to true and does not trigger current transfer. On subsequent time
441steps, neurons connected to that neuron checks if it spiked at time
442$t-d$ and, if so, adds its synaptic current. The upside of this
443algorithm is that it is synchronization-free; neurons do not write to
444shared memory. The major drawback is that every neuron at every time
445step must check all its incoming synapses to see from which of them it
446receives current. Additionally, the algorithm reads from scattered
447memory on line 18. The large amount of data it reads probably makes
448it inefficient, unless the number of computational units is
449large and memory reads are substantially cheaper than writes. Neither
450of which is true for our FPGA. Consequently, we focus on push-based
451spike transfer in this work.
452
453Listing \ref{lst:push2} shows a variant of push-based spike transfer
454that first collects spiking neurons and then transfers their synapses
455current in a dedicated phase. Collection can either be done with a
456marking array, at the expensive of wasting memory, or with a queue (as
457in the listing), at the expense of making loop iterations
458dependent. The choice depends on the target platform. Though,
459splitting the update and spike transfer into two phases reduces the
460number of cache conflicts which is advantagenous. Lines
46113 to 15 of the basic push algorithm can flush prefetched and cached
462parts of the \verb!U!, \verb!I!, and \verb!R! arrays.
463
464\subsection{Buffer Sizing and Wrap**}
465\label{sec:sizing}
466
467To reduce the size of the \verb!W! array we use ‘‘wrap-around
468indexing’’. The indices of the rows in \verb!W! written to in one time
469step lies within the interval $t + 1$ to $t + d_\mathrm{max} - 1$,
470where $d_\mathrm{max} - 1$ is the network’s largest synaptic delay and
471is small. Moreover, at time step $t$ rows $0$ to $t - 1$ will not be
472read again so that space can be reused. We do that by setting the
473number of rows in \verb!W! to $d_\mathrm{max}$ and use modular
474arithmetic to index rows. The expressions for accessing \verb!W! on
475line 15 and 17 in listing \ref{lst:push} become
476\verb!W[(t + d) % D_MAX, j]! and \verb!W[t % D_MAX, i]!,
477respectively. We choose a large enough value for $d_\mathrm{max}$ by
478evaluating the cumulative distribution function of the Gaussians we
479sample the slower excitatory synapses delays from, $\mathcal{N}(1.5,
4800.75)$:
481\begin{equation}
482 P(\mathcal{N}(1.5, 0.75) \leq 6.4) \approx 0.999999999968:
483\end{equation}
484This shows that with $\Delta t = 0.1$, 64 rows is more than enough
485since the probability of sampling a synaptic delay longer than 6.4 ms
486is astronomically low. It is also a power of two so we use masking to
487realize the modular arithmetic. With this scheme we also have to clear
488\verb!W[t, i]! after reading it to avoid double reads.
489
490Even with only 64 rows and assuming four-byte floats, the \verb!W!
491array still consumes $64\cdot4\cdot 77169 \approx 20$ megabytes which
492is more than we can fit in on-chip memory. We could store the array in
493off-chip memory -- which is plentiful -- or use half-precision two-byte
494floats instead. Neither solution is satisfactory. As we argued in
495\cref{sec:push-pull}, we need fast reads and writes to uncorrelated
496addresses of \verb!W! which off-chip memory doesn’t give us. We also
497prefer not to lower the numeric the precision as that makes the
498simulation less accurate. A third option is to store all spiking
499neurons in a queue and only activate a subset of their synapses at a
500given time step.
501
502\subsection{Just-In-Time Spike Transfer}
503\begin{figure}
504 \begin{lstlisting}[style=pseudocode]
505parfor i in range(n_neurons):
506 {Update U[i], R[i] as before}
507 spiked[i] = {True if neuron spiked}
508 I[i] = p11*I[i] + T[t, i]*wpsn
509for rt in range(D_MAX):
510 delay = (t - rt) % D_MAX
511 if delay < D_MAX - 1:
512 for n in enqueued_at(Q, rt):
513 syns = syns_from(n, delay)
514 parfor j, d, w in syns:
515 I[j] += w
516 else:
517 Q = dequeue(Q, rt)
518rt = t % D_MAX
519for i in range(n_neurons):
520 if spiked[i]:
521 Q = enqueue(Q, rt, i)\end{lstlisting}
522 \caption{Three-phase just-in-time spike transfer}
523\label{lst:jit}
524\end{figure}
525
526\noindent By activating all the neuron’s synapses at once, the push
527algorithm works harder than necessary. Obviously, all synapses must
528\textit{eventually} be activated, but \textit{right now} only synapses
529with a delay of one time step must be activated since the current they
530transfer will be read at the next time step. This observation leads us
531to a ‘‘lazy’’ just-in-time algorithm which keeps all spiking neurons
532in a queue for a fixed number of time steps. At every time step it
533activates those synapses whose current is read at the next time
534step. Thus, a neuron that spikes at time $t$ will have its
535one-tick-delay synapses activated at time $t + 1$, its two-tick-delay
536synapses at time $t + 2$, and so on. Listing \ref{lst:jit} sketches a
537three-phase algorithm built on this idea. The first phase (line 1 to
5384) updates \verb!U! and \verb!R! as before and marks spiking neurons
539in \verb!spiked!. The third phase (lines 14 to 17) enqueues the marked
540neurons with a ‘‘relative timestamp’’, \verb!rt!. The second phase
541(lines 5 to 13) scans the queue and calls \verb!enqueued_at! to fetch
542previously enqueued neurons. The \verb!syns_from! function works as
543before, but now has a second parameter to select synapses with the
544given delay. Suppose that the simulator simulates time step 100 and
545that \verb!rt! is 10 in one iteration of the loop. Since $26 \equiv
546(100 - 10) \mod d_{\mathrm{max}}$ ($d_{\mathrm{max}}=64$) all synapses
547with 26 time steps of delay of queued neurons with a relative
548timestamp of 10 will be activated. These neurons ought to have been
549stored at time step $t = 100 - 26 = 76$ which is indeed the case since
550$10 \equiv 74 \mod d_\mathrm{max}$. After the neurons have stayed in
551the queue for \verb!D_MAX - 1! time steps, the \verb!dequeue! call
552evicts them. Though with a sensible FIFO implementation eviction is a
553no-op so we omit it in future pseudo-code.
554
555How much memory do the just-in-time algorithm save? Through
556experimentation we found that the average number of spiking neurons
557per time step is 23, meaning that the expected number of
558elements in \verb!Q! is $23\cdot d_{\mathrm{max}}=1472$. We generously
559round it up to 4096 and, as neuron indices take four-bytes to store,
560the total size of the queue is 16 kilobytes, which comfortably fits in
561on-chip memory.\footnote{As in \cref{sec:sizing}, we could use the
562cumulative distribution function to show that 4096 elements is enough
563to make the risk of overflow virtually zero.}
564
565To improve the just-in-time algorithm we use ‘‘lanes’’. Essentially,
566we duplicate the \verb!W! array so that synapses of multiple spiking
567neurons can be activated simultaneously. The algorithm writes the
568synaptic current of the first neuron it handles in the first lane, of
569the second neuron in the second lane, and so on. Synaptic current of
570neurons in different lanes does not interfere. The update phase has to
571be adjusted accordingly and must sum the incoming current from all
572lanes. As each lane consumes about 320 kb of memory we can fit at most
57316 lanes.
574
575\subsection{Horizon-Based Spike Transfer}
576\begin{figure}
577 \begin{lstlisting}[style=pseudocode]
578parfor i in range(n_neurons):
579 I[i] += W[t % H, i]
580 W[t % H, i] = 0
581 {update U[i] and R[i] as before}
582 I[i] = p11*I[i] + wspn*T[t, i]
583for i in range(D_MAX / H):
584 rt = (t - H*i - 1) % D_MAX
585 d_from = H*i + 1
586 d_to = d_from + H
587 for n in enqueued_at(Q, rt):
588 syns = syns_from(
589 n, d_from, d_to)
590 parfor j, d, w in syns:
591 W[(d + t) % H, j] += w
592rt = t % D_MAX
593for i in range(n_neurons):
594 if spiked[i]:
595 Q = enqueue(Q, rt, i)\end{lstlisting}
596 \caption{Three-phase spike transfer with configurable horizon}
597 \label{lst:horizon2}
598\end{figure}
599
600\noindent The just-in-time algorithm requires a fair amount of
601bookkee**. It activates the spiking neuron’s synapses
602$d_{\mathrm{max}}$ times instead of just once, and the loop on line 10
603and 11 iterates many fewer times than the corresponding loop on line
60414 and 15 of the basic push algorithm. This loop is an important
605source of parallelism and running it many times with fewer iterations
606is much worse for performance than running it fewer times with many
607iterations. Our solution is to reintroduce a smaller version of the
608spike buffer whose number of rows, $h$ is a factor of $d_\mathrm{max}$
609so that when a neuron spikes we write to the buffer $d_\mathrm{max} /
610h$ times. Listing \ref{lst:horizon2} shows the concept. The
611\verb!syns_from! function now retrieves synapses of the neuron whose
612delay is within the range $d_\mathrm{from}$ to $d_\mathrm{to} -
6131$. Suppose $d_\mathrm{max}=64$, $t=100$ and $h=16$. The loop on lines
6145 to 13 iterates four times, the relative timestamps assumes the
615values 35, 19, 3, and 51, and the half-open intervals the values [1,
616 17), [17, 33), [33, 49), and [49, 65). Thus, the inner loop on line
617 10 to 14 activates all synapses with the given delays of the
618 neurons stored at the given relative timestamps. With this
619 scheme we trade-off on-chip memory for better concurrency.
620
621Note that we add the current from the spike buffer to the presynaptic
622potential on line 2, before we update the membrane’s potential and
623we add one to \verb!d_from! on line 7. This is necessary since the
624algorithm transfers spikes one time step later than the basic
625push algorithm.
626
627\subsection{Storing Synapses}
628\label{sec:syn-store}
629\begin{figure}
630 \begin{lstlisting}[style=pseudocode]
631def syns_from(n, d_from, d_to):
632 start = X[n, d_from]
633 end = X[n, d_to]
634 for i in range(start, end):
635 yield S[i]\end{lstlisting}
636\caption{Indexed access to synapse data}
637\label{lst:index}
638\end{figure}
639
640\noindent Previous sections’ pseudo-codes imply that it is very
641important that synapses can be queried by their sender neuron
642quickly. In particular, synapses should be stored so that one
643\verb!syns_from! call only accesses memory in one contiguous chunk. We
644fullfill these goals by storing the synapses as an array sorted on
645sender neuron, receiver neuron, and delay that we query with a
646prebuilt index keyed on sender neuron and delay. This means that
647finding all synapses for a particular neuron or neuron-delay
648combination requires only two index lookups; one for the first synapse
649and one for the last synapse. Moreover, as the index implicitly stores
650the sender neuron, we only need eight bytes to represent a synapse;
651four for the single-precision weight (32 bits), three for the
652destination neuron’s id (17 bits), and one for the synaptic delay (6
653bits).\footnote{With just-in-time transfer the delay is also stored
654implicitly.} Listing \ref{lst:index} shows how \verb!syns_from!
655performs index look-ups. The two-dimensional array \verb!X!
656represents the index and \verb!S! the synapse array so that
657\verb!X[n, d_from]! contains the index in \verb!S! where the first
658synapse of neuron \verb!n! with delay \verb!d_from! is
659stored. Ideally, the synapses should also be prefetched.
660
661\begin{figure}
662 \begin{lstlisting}[style=pseudocode]
663parfor c in range(N_CLS):
664 for i in contents(Q):
665 syns = syns_from(i, c)
666 for j, d, w in syns:
667 W[t + d, j] += w
668Q = clear(Q)\end{lstlisting}
669\caption{Spike transfer with partitioned synapses}
670\label{lst:parts}
671\end{figure}
672
673\subsection{Disjoint Synapses}
674\label{sec:disjoint}
675
676A headache for push-based spike transfer is the data race caused by
677multiple synapses delivering current received by the same neuron at
678the same time step. This is why we can’t use \verb!parfor! on line 5
679of listing \ref{lst:push2}, for example. We cannot completely solve
680this problem, but we can alleviate it by partitioning the synapses
681into disjoint \textit{classes}, so that synapses of different classes
682never trigger writes to the same memory addresses at the same
683time. Then every class can be handled by a separate thread. The
684pseudocode in listing \ref{lst:parts} modifies the transfer phase of
685the deferred push algorithm from \ref{lst:push2} to exploit of this
686idea.\footnote{We can of course us the same technique to improve all
687other algorithms we have discussed.} The constant \verb!N_CLS!
688denotes the number of synapse classes and the extra argument to
689\verb!syns_from! which synapse class to query. Many ways of
690partitioning the synapses into disjoint classes are possible. For
691example, by delay so that one thread writes one-time step synapses,
692the next thread two-time steps synapses, and so on. Another by
693destination neuron so that one thread handles synapses going to
694neurons whose index is between 0 and 199, another those
695between 200 and 399, and so on.
696
697\begin{figure}
698 \begin{lstlisting}[style=pseudocode]
699o0 = X[i][0]
700o1 = X[i][D_MAX]
701parfor c in range(N_CLS):
702 for o in range(o0 + c, o1, N_CLS):
703 j, d, w = S[o]
704 W[(t + d) % D_MAX, j] += w\end{lstlisting}
705 \caption{Spike transfer over interleaved synaptic storage.}
706 \label{lst:parts2}
707\end{figure}
708
709We choose to partition by the destination neuron’s congruence
710class. The method has low overhead and evenly distributes the synapses
711over the classes since the least significant bits of the neuron index
712is almost random. To retain the contiguous storage we interleave the
713synapses. That is, if a neuron’s synapses are found at indexes $o_0$
714to $o_1 - 1$, then all synapses of class $c$ are stored at indices
715$o_0 + n_ci + c$, where $n_c$ is the number of classes and $i$ is a
716non-negative integer. Interleaving synergizes with banked memory
717common on many GPUs. On our FPGA, it means that we can have
718conflict-free dedicated memory ports for every synapse class. The
719method causes some memory waste however. For example, if there are
720four classes and all destination neuron indices of all synapses of
721some neuron happen to be congruent with $2 \mod 4$, then all indices
722other than $o_0 + 4i + 2$ will be vacant. I.e., 75% of the space will
723go to waste. In general, the memory consumption for storing a neuron’s
724synapses grows from $n_ss$, where $n_s$ is its number of synapses and
725$s$ the synapse size to $n_cls$, where $l$ is the number of synapses
726in the neuron’s largest class. In practice, the memory waste
727is manageable; in the order of 5-30% depending on the number of
728classes and horizon. The more the classes and the shorter the horizon
729the more uneven the classes become and the more waste. We do not mark
730vacancies and instead fill them with synapses carrying no current and
731terminating at idempotent neurons. This way, the code in listing
732\ref{lst:parts2} does not need to check whether the index is vacant.
733
734\subsection{More on Parallelism}
735\label{sec:more-para}
736\begin{figure*}
737 \begin{subfigure}{0.48\textwidth}
738 \begin{lstlisting}[style=pseudocode]
739parfor i in range(n_neurons):
740 {Update neuron state as before}
741 if spikes:
742 write(to_transfer, i)
743write(to_update, DONE)
744read(to_update)\end{lstlisting}
745 \caption{Update kernel}
746 \end{subfigure}
747 \begin{subfigure}{0.48\textwidth}
748 \begin{lstlisting}[style=pseudocode]
749while True:
750 i = read(to_transfer)
751 if i == DONE:
752 break
753 parfor j, d, w in syns_from(i):
754 W[t + d, j] += w
755 write(to_update, True)\end{lstlisting}
756 \caption{Transfer kernel}
757 \end{subfigure}
758 \caption{Basic spike transfer using two kernels}
759 \label{lst:multi}
760\end{figure*}
761
762\noindent Our algorithms’ source of parallelism is the \verb!parfor!
763keyword. Iterations of such loops are independent and can be executed
764concurrently in duplicated hardware. The replication is realized
765differently on different targets. On CPUs we use SIMD and on GPUs the
766single-instruction multiple threads (SIMT) execution model
767‘‘automatically’’ parallelizes the computation. With Intel’s OpenCL
768SDK we use the \verb!#pragma unroll N! and \verb!#pragma ivdep!
769compiler directives to instruct the compiler to replicate loop
770hardware. Essentially, this ‘‘widens’’ data paths, allowing more data
771to be processed in parallel. But FPGAs can also run more data paths in
772parallel, akin to how multiple CPU cores can run multiple threads. We
773implement this parallelism by dividing the algorithms over multiple
774concurrent communicating kernels.
775
776Listing \ref{lst:multi} restructures the deferred push-based spike
777transfer algorithm in \ref{lst:push2} as two kernels. It uses two
778blocking FIFOs, \verb!to_transfer! and \verb!to_update!, that the
779kernels can \verb!read! and \verb!write! to. When a neuron spikes, the
780update kernel sends its index to the spiking kernel which activates
781that neuron’s synapses. Hence, the neuron update and spike transfer
782phases run in parallel. When the update kernel has updated all neurons
783it sends a \verb!DONE! message and waits for a reply from the spike
784transfer kernel. When it receives one, it knows that the spike
785transfer kernel has transferred all spikes and it proceeds to the next
786time step. As inter-kernel communication is not well-supported in
787OpenCL, we implement it using the low-latency Intel-specific channel
788extension.
789
790\begin{figure*}
791 \begin{subfigure}{0.48\textwidth}
792 \begin{lstlisting}[style=pseudocode]
793for i in range(n_neurons):
794 I[i] += read(to_update)
795A = []
796for i in range(n_neurons):
797 {Update neuron state as before}
798 if spiked[i]:
799 A.append(i)
800write(to_transfer, A)\end{lstlisting}
801 \caption{Update kernel}
802 \end{subfigure}
803 \begin{subfigure}{0.48\textwidth}
804 \begin{lstlisting}[style=pseudocode]
805for i in range(n_neurons):
806 write(to_update, W[t % H, i])
807 W[t % H, i] = 0
808for i in range(D_MAX / H):
809 rt = (t - H*i - 1) % D_MAX
810 for n in enqueued_at(Q, rt):
811 syns = syns_from(n, rt + H)
812 parfor j, d, w in syns:
813 W[(d + t) % H, j] += w
814for n in read(to_transfer):
815 Q = enqueue(Q, t % D_MAX, n)\end{lstlisting}
816 \caption{Transfer kernel}
817 \end{subfigure}
818 \caption{Horizon-based spike transfer using two kernels}
819 \label{lst:multi-horiz}
820\end{figure*}
821
822Listing \ref{lst:multi-horiz} similarily restructures the
823horizon-based algorithm as two kernels. Since the \verb!W! array is
824stored in local memory, the multi-kernel approach requires an explicit
825synchronization step (lines 1 and 2, and 1 to 3) for sending
826presynaptic current from the transfer kernel to the update kernel over
827the \verb!to_update! channel. After synchronization, the transfer
828kernel can freely write synaptic current from queued neurons to
829\verb!W!. While updating neurons, the update kernel keeps track of
830which of them spiked and sends them to the transfer kernel. The
831transfer kernel adds them to its queue \verb!Q!.
832
833\subsection{Implementations}
834\label{sec:impl}
835\begin{figure}
836 \center
837 \includegraphics[width=\linewidth]{tree}
838 \caption{Taxonomy of our simulator implementations}
839 \label{fig:taxonomy}
840\end{figure}
841\begin{figure*}
842 \center
843 \includegraphics[width=\linewidth]{impl_gmem}
844 \caption{Schematic of the monolithic gmem implementations}
845 \label{fig:gmem}
846\end{figure*}
847\begin{figure*}
848 \center
849 \includegraphics[width=\linewidth]{impl_horizon}
850 \caption{Schematic of the multi-kernel horiz implementations}
851 \label{fig:gmem}
852\end{figure*}
853
854Given the algorithms and data structures presented, we have created
855twelve simulator families arranged into a taxonomy in
856\cref{fig:taxonomy}. The taxonomy’s second level groups the simulators
857on their spike transfer algorithm. It can either be the basic
858push-algorithm from listing \ref{lst:push} (gmem), the just-in-time
859algorithm from listing \ref{lst:jit} (jit), or the horizon-based
860algorithm from listing \ref{lst:horizon2} (horiz). These three
861algorithms differ in how eagerly they activate synapses. The gmem
862algorithm is maximally eager and activates all the spiking neuron’s
863synapses at once. Thus, it has to use global off-chip memory to transfer
864spikes, hence its name. The jit algorithm is maximally lazy and only
865activates synapses whose current should arrive at the next time
866step. The horizon algorithm is a mix of the two and activates as many
867synapses as its horizon allows. The taxonomy’s third level shows
868whether the simulator uses multiple communicating kernels (multi) or a
869single, monolithic kernel (mono). The last level specifies whether the
870simulator uses double (d) or single precision (s) floating point
871values for neuron state. This applies to the membrane potentials,
872presynaptic currents, and their related coefficients, but not to the
873spikes and spike transfer buffers which always are in
874single-precision. In the following, we use a notation based on their
875taxonomic grou** to refer to the simulators. For example,
876‘‘horiz/multi/s’’ refers to the horizon-based multi-kernel
877single-precision simulators.
878
879All simulators also have parameters for tuning their performance
880characteristics. Some are unique to certain families while others are
881shared. The \textit{update width} is the most
882fundamental one and controls how many neurons the simulator updates at
883once. We implement it by unrolling loops such as the loop on line 1 to
8844 of listing \ref{lst:jit}. More loop-unrolling is not always better
885as it causes the compiler to duplicate the hardware used to synthesize
886those loops. The gmem/mono and gmem/multi simulators have a
887\textit{synapse unroll} (SU) parameter which controls how many times
888their spike transfer loops are unrolled (line 6 and 7 of listing
889\ref{lst:push2} and lines 5 and 6 of listing \ref{lst:multi}). More
890unrolling allows more synapses to be fetched from memory
891simultaneously. The \textit{synapse classes} (SC) parameter determines
892the number of congruence classes (see \cref{sec:disjoint}). More
893classes allows more synapses to be activated in parallel, but also
894increases memory usage. The horiz simulators have a
895\textit{horizon length} (H) parameter which determines how many time
896steps worth of synapses the simulator activates at once. The jit
897simulators have a \textit{lane count} (LC) parameter that controls how
898many lanes they uses to transfer current. For convenience and
899efficiency, all parameters are positive powers of two.
900
901\section{Results}
902\label{sec:results}
903
904We evaluate our simulators on three axes -- correctness, speed, and
905energy consumption -- using the same experimental regimen for each. We
906synthesize every simulator for our FPGA with Intel FPGA SDK for OpenCL
907version 21.2 and use it to simulate ten seconds of biological time
908with a time step of 0.1 ms (i.e. 100 000 ticks) on ten different
909random microcircuit instances. From these 100 runs we compute mean
910values. We run all simulators on the full version (scale 1.0) of the
911microcircuit. Due to time constraints, we only synthesize a handful of
912simulators for every family and parameter combination. Some simulators
913could achieve better performance with a more exhaustive search.
914
915\subsection{Correctness}
916\begin{figure*}
917 \center
918 \includegraphics[width=\linewidth]{scatter}
919 \caption{Spike plot of the first 1000 ms of simulation. The
920 rhythmic nature of the microcircuit’s spiking pattern is apparent.}
921 \label{fig:scatter}
922\end{figure*}
923\begin{figure*}
924 \begin{subfigure}{.33\textwidth}
925 \centering
926 \includegraphics[width=0.9\linewidth]{counts}
927 \caption{Spiking rate}
928 \end{subfigure}
929 \begin{subfigure}{.33\textwidth}
930 \centering
931 \includegraphics[width=0.9\linewidth]{cvisi}
932 \caption{CV ISI}
933 \end{subfigure}
934 \begin{subfigure}{.33\textwidth}
935 \centering
936 \includegraphics[width=0.9\linewidth]{pearsonccs}
937 \caption{Pearson corr. coeff.}
938 \end{subfigure}
939 \caption{Kernel density estimates of \textbf{a)} spikes per second,
940 \textbf{b)} covariance of interspike intervals, and \textbf{c)}
941 Pearson correlation coefficient between binned spike trains for
942 neuron samples. Blue lines for NEST, gold for gmem/mono/s, and
943 green for jit/mono/s.}
944 \label{fig:stats}
945\end{figure*}
946\begin{figure}
947 \center
948 \includegraphics[width=\linewidth]{kldiv}
949 \caption{Kullback-Leibler (KL) divergence between distributions of
950 single-neuron firing rates, CV ISIs, and Pearson correlation
951 coefficients. Blue bars show KL divergences between NEST
952 simulations run on different random seeds, gold and green bars
953 between NEST and gmem/mono/s and jit/mono/s run on the same random
954 seeds.}
955 \label{fig:kldiv}
956\end{figure}
957
958We verify our simulators’ correctness using the same methods
959\cite{knight2018}; \cite{vanalbada2018}; \citep{potjans2014}; and
960others used. I.e., we simulate the microcircuit initialized with the
961same random seed with our simulators and with grid-based
962(double-precision) NEST which we treat as our reference. We discard
963spikes during the first second and compute the following statistics
964over the remaining nine seconds over the eight populations; rate of
965spikes per second, covariant of interspike intervals, and Pearson
966correlation coefficient over binned spike trains. We smooth each
967distribution with Gaussian kernel density estimation with bandwidth
968selected using Scott’s Rule.
969
970\Cref{fig:stats} shows the distributions plotted for NEST in blue, the
971gmem/mono/s family in gold, and the jit/multi/s family in green. The
972plots indicate that the simulators produce distributions that are very
973similar to NEST’s which implies that the accuracy loss caused by the
974reduced numerical precision is negligible. We do not plot the
975distributions for our other families of single-precision simulators as
976they are even closer to NEST’s distributions. Neither do we plot
977distributions for our double-precision simulators as they produce
978results that are spike-for-spike identical to NEST. The
979Kullback-Leibler (KL) divergence between the distributions, shown in
980\cref{fig:kldiv}, quantifies the apparent similarities. The figure’s
981blue, gold, and green bars show the KL divergences between two NEST
982simulations initialized with different random seeds, between NEST and
983gmem/mono/s, and between NEST and jit/mono/s. The latter two
984initialized with identical seeds. The divergence between two random
985seeds are much larger than between NEST and our simulators, indicating
986that they are accurate.
987
988The non-associativity of IEEE 754 floating-point is the main reason
989for the small differences. The order presynaptic current is added
990depends on the spike transfer algorithm, which affects rounding. The
991differences are stochastic and do not bias the result.
992
993\subsection{Simulation speed}
994
995%% \begin{table*}
996%% \centering
997%% \footnotesize
998%% \begin{tabular}{crrrrrrrrrr}
999%% \toprule
1000%% \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{RTF} & \textbf{Fmax} & \textbf{ALUT} & \textbf{FF} & \textbf{RAM} & \textbf{DSP}\\
1001%% \midrule
1002%% gmem/mono/s & 1 & 2 & n/a & n/a & 5.84 & 608 & 5% & 3% & 13% & 0%\\
1003%% ’’ & 2 & 2 & n/a & n/a & 5.24 & 608 & 5% & 3% & 14% & 0%\\
1004%% ’’ & 4 & 2 & n/a & n/a & 4.88 & 608 & 5% & 3% & 14% & 1%\\
1005%% ’’ & 8 & 2 & n/a & n/a & 4.78 & 609 & 6% & 3% & 13% & 1%\\
1006%% ’’ & 8 & 4 & n/a & n/a & 4.82 & 607 & 7% & 4% & 15% & 1%\\
1007%% ’’ & 8 & 8 & n/a & n/a & 5.07 & 572 & 8% & 5% & 17% & 1%\\
1008%% ’’ & 16 & 2 & n/a & n/a & 4.63 & 607 & 7% & 4% & 14% & 1%\\
1009%% ’’ & 16 & 4 & n/a & n/a & 4.72 & 607 & 8% & 4% & 15% & 2%\\
1010%% ’’ & 32 & 2 & n/a & n/a & 4.66 & 600 & 9% & 4% & 14% & 3%\\
1011%% \specialrule{0.25pt}{1pt}{1pt}
1012%% gmem/mono/d & 8 & 2 & n/a & n/a & 4.75 & 612 & 16% & 7% & 19% & 2%\\
1013%% ’’ & 16 & 2 & n/a & n/a & 4.65 & 600 & 27% & 11% & 19% & 4%\\
1014%% ’’ & 32 & 2 & n/a & n/a & 4.64 & 602 & 49% & 20% & 21% & 9%\\
1015%% \specialrule{0.25pt}{1pt}{1pt}
1016%% gmem/multi/s & 8 & 2 & n/a & n/a & 4.77 & 611 & 5% & 3% & 13% & 1%\\
1017%% ’’ & 16 & 2 & n/a & n/a & 4.92 & 608 & 6% & 3% & 12% & 1%\\
1018%% \specialrule{0.25pt}{1pt}{1pt}
1019%% gmem/multi/d & 8 & 2 & n/a & n/a & 4.85 & 609 & 15% & 7% & 18% & 2%\\
1020%% ’’ & 32 & 2 & n/a & n/a & 4.89 & 609 & 47% & 19% & 20% & 9%\\
1021%% \specialrule{0.25pt}{1pt}{1pt}
1022%% jit/mono/s & 2 & n/a & n/a & 2 & 5.07 & 602 & 5% & 3% & 12% & 0%\\
1023%% ’’ & 4 & n/a & n/a & 4 & 4.64 & 611 & 5% & 3% & 12% & 0%\\
1024%% ’’ & 8 & n/a & n/a & 8 & 4.41 & 596 & 6% & 3% & 12% & 1%\\
1025%% ’’ & 16 & n/a & n/a & 16 & 4.58 & 614 & 7% & 3% & 12% & 2%\\
1026%% \specialrule{0.25pt}{1pt}{1pt}
1027%% jit/mono/d & 8 & n/a & n/a & 8 & 4.73 & 612 & 16% & 6% & 19% & 2%\\
1028%% \specialrule{0.25pt}{1pt}{1pt}
1029%% jit/multi/s & 8 & n/a & n/a & 8 & 4.67 & 610 & 7% & 3% & 15% & 1%\\
1030%% \specialrule{0.25pt}{1pt}{1pt}
1031%% jit/multi/d & 4 & n/a & n/a & 4 & 5.27 & 604 & 11% & 5% & 24% & 1%\\
1032%% ’’ & 8 & n/a & n/a & 8 & 5.17 & 604 & 17% & 7% & 24% & 2%\\
1033%% \specialrule{0.25pt}{1pt}{1pt}
1034%% horiz/mono/s & 32 & n/a & 16 & 32 & 0.81 & 605 & 10% & 4% & 70% & 4%\\
1035%% ’’ & 64 & n/a & 16 & 64 & 0.87 & 577 & 14% & 6% & 72% & 7%\\
1036%% \specialrule{0.25pt}{1pt}{1pt}
1037%% horiz/mono/d & 16 & n/a & 16 & 16 & 0.84 & 602 & 27% & 11% & 77% & 5%\\
1038%% ’’ & 32 & n/a & 16 & 32 & 0.81 & 597 & 48% & 19% & 76% & 9%\\
1039%% \specialrule{0.25pt}{1pt}{1pt}
1040%% horiz/multi/s & 32 & n/a & 16 & 32 & 1.30 & 609 & 10% & 5% & 72% & 4%\\
1041%% \specialrule{0.25pt}{1pt}{1pt}
1042%% horiz/multi/d & 16 & n/a & 16 & 32 & 0.81 & 575 & xx% & xx% & xx% & x%\\
1043%% \bottomrule
1044%% \end{tabular}
1045%% \caption{Performance of some configurations of our simulators (Old make new!)}
1046%% \label{tbl:perf2}
1047%% \end{table*}
1048
1049\begin{table*}
1050 \centering
1051 \footnotesize
1052 \begin{tabular}{crrrrrrrrrrrr}
1053 \toprule
1054 \textbf{Simulator} & \textbf{UW} & \textbf{SU} & \textbf{H} & \textbf{SC} & \textbf{LC} & \textbf{RTF} & \textbf{Freq.} & \textbf{ALUT} & \textbf{Reg.} & \textbf{ALM} & \textbf{M20K} & \textbf{DSP}\\
1055 \midrule
1056 gmem/mono/s & 8 & 2 & n/a & n/a & n/a & 4.79 & 601 & 126k & 313k & 21% & 23% & 1%\\
1057 ’’ & 64 & 2 & n/a & n/a & n/a & 4.93 & 585 & 155k & 391k & 26% & 24% & 2%\\
1058 \specialrule{0.25pt}{1pt}{1pt}
1059 gmem/mono/d & 8 & 1 & n/a & n/a & n/a & 5.38 & 608 & 141k & 371k & 24% & 27% & 2%\\
1060 ’’ & 8 & 2 & n/a & n/a & n/a & 4.71 & 608 & 147k & 394k & 25% & 28% & 2%\\
1061 ’’ & 16 & 2 & n/a & n/a & n/a & 4.61 & 605 & 182k & 477k & 30% & 28% & 4%\\
1062 \specialrule{0.25pt}{1pt}{1pt}
1063 gmem/multi/s & 4 & 4 & n/a & n/a & n/a & 4.91 & 600 & 122k & 294k & 20% & 22% & 0%\\
1064 ’’ & 64 & 1 & n/a & n/a & n/a & 5.65 & 585 & 128k & 340k & 22% & 22% & 6%\\
1065 \specialrule{0.25pt}{1pt}{1pt}
1066 gmem/multi/d & 8 & 1 & n/a & n/a & n/a & 5.52 & 609 & 139k & 365k & 23% & 26% & 2%\\
1067 ’’ & 8 & 2 & n/a & n/a & n/a & 4.86 & 605 & 145k & 360k & 24% & 27% & 2%\\
1068 \specialrule{0.25pt}{1pt}{1pt}
1069 horiz/mono/s & 32 & n/a & 16 & 32 & n/a & 0.81 & 608 & 127k & 329k & 22% & 79% & 4%\\
1070 ’’ & 64 & n/a & 16 & 16 & n/a & 0.85 & 604 & 143k & 375k & 25% & 80% & 7%\\
1071 \specialrule{0.25pt}{1pt}{1pt}
1072 horiz/mono/d & 4 & n/a & 8 & 4 & n/a & 1.74 & 609 & 122k & 333k & 21% & 55% & 1%\\
1073 ’’ & 32 & n/a & 16 & 32 & n/a & 0.82 & 601 & 258k & 653k & 42% & 85% & 9%\\
1074 \specialrule{0.25pt}{1pt}{1pt}
1075 horiz/multi/s & 4 & n/a & 8 & 4 & n/a & 1.73 & 610 & 108k & 287k & 19% & 51% & 1%\\
1076 ’’ & 16 & n/a & 16 & 16 & n/a & 0.81 & 605 & 123k & 330k & 22% & 80% & 2%\\
1077 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 601 & 133k & 333k & 23% & 81% & 4%\\
1078 \specialrule{0.25pt}{1pt}{1pt}
1079 horiz/multi/d & 4 & n/a & 8 & 4 & n/a & 1.79 & 583 & 126k & 319k & 22% & 57% & 1%\\
1080 ’’ & 16 & n/a & 16 & 16 & n/a & 0.80 & 607 & 187k & 492k & 32% & 85% & 5%\\
1081 ’’ & 32 & n/a & 16 & 32 & n/a & 0.79 & 600 & 265k & 664k & 43% & 86% & 9%\\
1082 \specialrule{0.25pt}{1pt}{1pt}
1083 jit/mono/s & 16 & n/a & n/a & 16 & 16 & 1.47 & 590 & 116k & 319k & 22% & 79% & 7%\\
1084 \specialrule{0.25pt}{1pt}{1pt}
1085 jit/mono/d & 4 & n/a & n/a & 4 & 8 & 2.23 & 611 & 122k & 317k & 21% & 56% & 2%\\
1086 ’’ & 16 & n/a & n/a & 16 & 16 & 1.44 & 597 & 182k & 478k & 32% & 85% & 10%\\
1087 ’’ & 32 & n/a & n/a & 32 & 16 & 1.50 & 584 & 268k & 704k & 44% & 85% & 20%\\
1088 \specialrule{0.25pt}{1pt}{1pt}
1089 jit/multi/s & 16 & n/a & n/a & 16 & 16 & 1.43 & 593 & 120k & 325k & 22% & 79% & 7%\\
1090 \specialrule{0.25pt}{1pt}{1pt}
1091 jit/multi/d & 4 & n/a & n/a & 4 & 8 & 2.21 & 600 & 125k & 317k & 22% & 55% & 2%\\
1092 ’’ & 8 & n/a & n/a & 8 & 16 & 1.56 & 602 & 147k & 393k & 26% & 84% & 5%\\
1093 ’’ & 16 & n/a & n/a & 16 & 16 & 1.34 & 606 & 186k & 492k & 33% & 85% & 10%\\
1094 \bottomrule
1095 \end{tabular}
1096 \caption{Speed and resource usage for some simulators. The table
1097 includes each family’s fastest simulator and some
1098 others for comparison purposes.}
1099 \label{tbl:perf3}
1100\end{table*}
1101
1102\begin{figure}
1103 \center
1104 \includegraphics[width=\linewidth]{update_widths}
1105 \caption{RTF as a function of update width for the fastest gmem/mono/s
1106 (blue), gmem/mono/d (gold), gmem/multi/s (green), and gmem/multi/d
1107 (red) simulators. Update widths larger than four does not decrease
1108 RTF.}
1109 \label{fig:uwidth}
1110\end{figure}
1111
1112\Cref{tbl:perf3} presents the performance and resource usage of our
1113fastest simulator configuraions. The first column shows the
1114simulator’s family in slash-notation (see \cref{sec:impl}). The next
1115five its parameters; \textit{update width} (UW), \textit{synapse
1116 unroll} (SU), \textit{horizon length} (H), \textit{synapse classes}
1117(SC), and \textit{lane count} (LC). The following column shows its
1118real-time factor (RTF), defined as the time taken to run the
1119simulation -- wall-clock time -- divided by the duration of the
1120simulated biological time (10 seconds). We measure the wall-clock time
1121as the time from the first \verb!clEnqueueNDRangeKernel! call until the
1122final simulation result can be read from the FPGA’s memory. As the RTF
1123does not vary beyond two decimal places even on runs on networks
1124initialized with different random seeds, we just report its mean. The
1125next columns shows the simulator’s operating frequency in MHz and FPGA
1126resource usage. \Cref{fig:rtf} plots the RTF of the fastest
1127simulators from every family as a bar chart.
1128
1129A startling finding is that there is no significant performance
1130difference between double and single precision neuron state and
1131between multi- and single-kernel implementations. The fastest
1132horiz/mono/s and horiz/multi/d simulators RTF is 0.81 and 0.79 which
1133is very close to each other. Presumably, the spike transfer phase is
1134much more expensive than the neuron update phase so making the latter
1135run faster, either by using single-precision or by overlap** the
1136update and transfer phases, does not improve performance. For the same
1137reason, increasing the gmem simulator’s \textit{update width} is
1138ineffective, as \cref{fig:uwidth} shows. Doubling the \textit{update
1139 width} roughly halves the number of cycles spent updating neuron
1140state, but even that is insignificant. The performance even detoriates
1141for \textit{update width} larger than 16. Likely because the
1142FPGA’s DDR interface is 512 bits wide.
1143
1144Neither the multi- nor single-kernel gmem simulators benefit from
1145more \textit{synapse unroll}. The reason could be because
1146the unrolled versions of the loop contains a false memory
1147dependency. There is no way of letting the OpenCL compiler know
1148that two iterations of the loop body -- \verb!W[t + d, j] += w! --
1149writes to distinct memory locations so the compiler refuses to
1150schedule multiple writes per clock cycle.
1151
1152The jit and horiz simulators are markedly faster than the gmem
1153simulators because they transfer spikes in on-chip memory. The best
1154gmem simulator has an RTF of 4.61 while it is 1.50 for the best jit
1155simulator and 0.79 for the best horiz simulator. The latter simulators
1156also uses up to 90% of the on-chip memory for storing the horizon and
1157lane buffers. The results suggest that the larger these buffers are
1158the better the performance. Unfortunately, our FPGA can not fit
1159horizons longer than 16 time steps or more than 16 lanes. The horiz
1160simulators’ performance edge over the jit simulators is due to them
1161running the spike transfer loop fewer times. The jit simulators run it
1162$d_\mathrm{max}$ times (i.e. 64) for every spiking neuron, while the
1163horizon simulators only runs it $d_\mathrm{max}/h$ times (i.e.,
116464/16=4 times). The horiz simulators also does not have to sum the
1165incoming current from multiple lanes when updating the neurons state.
1166
1167\begin{figure*}
1168 \center
1169 \includegraphics[width=\linewidth]{synapse_classes}
1170 \caption{The horizon and JIT algorithms’ RTF as a function of the
1171 number of \textit{synapse classes}. The \textit{horizon length}
1172 and \textit{lane count} parameters are both 16. The four colors
1173 represent the four implementation styles.}
1174 \label{fig:synapse_classes}
1175\end{figure*}
1176
1177\Cref{fig:synapse_classes} plots RTF as a function of the number of
1178\textit{synapse classes} for the horiz and jit simulators. For both,
1179performance improves until the parameter reaches 32. The more classes,
1180the more synapses can be activated in parallel which, clearly, is
1181important for performance. The drawback of increasing the number of
1182classes is memory waste. Especially for the jit simulators which
1183iterate many fewer times per spike transfer loop. With 16 classes the
1184average occupancy is only about 74%. It decreases to 47% with 32
1185classes, meaning that the simulator accesses more than twice as much
1186data than it needs. Interestingly, there is no performance
1187penalty. The benefits of handling many synapses in parallel is worth
1188lots of extra off-chip memory reads. Perhaps since we store the synapses
1189in contiguous memory reading them is quite cheap.
1190
1191All simulators run at around 600 MHz, considered very good for HLS. In
1192our experience, operating frequencies much lower than that were
1193caused by inefficiently banked on-chip memory, unaccounted for
1194loop-carried dependencies, or similar issues.
1195
1196%% The fastest horiz and jit simulators have a \textit{horizon length}
1197%% and \textit{lane count} of 16.
1198
1199%% Both the horiz and jit simulators transfer spikes through local memory
1200%% and have a large performance advantage compared to the gmem
1201%% simulators. These simulators use between 80 to 90% of local memory
1202%% and would surely benefit from having even more available. A \textit{horizon length
1203
1204%% Doubling the
1205%% \textit{horizon length} would require twice as much local memory, but
1206%% would cut the number of times the neuron’s synapses had to be accessed
1207%% in half.
1208
1209%% The performance of the gmem simulators is comparable to that of the
1210%% lmem/jit simulators, even though the former transmit spikes through
1211%% global memory. As they activate all of the spiking neuron’s synapses
1212%% in one go -- rather than spreading out the activations over multiple
1213%% time steps -- they are simpler and require less bookkee** than their
1214%% local-memory-using counterparts. Likely, this is what makes them
1215%% competitive. Neither the monolithic or multi-kernel versions of the
1216%% simulators benefit from having their spike propagation loops unrolled
1217%% more than four times because the unrolled version of the loop contains
1218%% a false memory dependency. As there is no way of letting the OpenCL
1219%% compiler know that two iterations of the loop body --
1220%% \verb!W[t + d, j] += w! -- always writes to distinct memory addresses
1221%% the compiler refuses to schedule multiple writes per cycle.
1222
1223
1224
1225%% The results also show that the multi-kernel implementations are less
1226%% efficient than the single-kernel ones. While channels are low-overhead
1227%% on Intel FPGAs, the overhead is still present and offsets whatever
1228%% benefit one gets from running the algorithm’s phases in
1229%% parallel. Presumably, because spike propagation is by far the most
1230%% expensive phase. This is in concordance with another observation; the
1231%% speed difference between single and double precision neuronal state is
1232%% negligible. Performance is not limited by compute resources, but by
1233%% memory bandwidth. However, double precision still carries a penalty in
1234%% that we have to use twice as much block memory to store the neuronal
1235%% state. The number of DSPs required is also larger than for
1236%% single-precision implementations.
1237
1238%% The lmem/horiz simulators are the fastest by a wide margin. As the
1239%% performance difference between the mono and multi variants indicate,
1240%% they benefit greatly from running the update and propagation phases in
1241%% parallel. This is because they use local memory to transmit spikes so
1242%% the propagation phase is, relatively speaking, cheaper and the neuron
1243%% update phase costlier. Thus, the benefit of running the phases in
1244%% parallel is bigger. Moreover, the usage of channels means that kernels
1245%% do not have to produced buffers of spiking neurons that are then
1246%% consumed in a future phase. Instead, the kernels can send spiking
1247%% neurons to the consuming kernel directly. All kernels of this family
1248%% benefit from larger horizons. Larger horizons mean that more synapses
1249%% can be activated simultaneously which is good for parallelization. It
1250%% also meshes well with larger \textit{synapse classes} -- the more
1251%% synapses per class the less memory waste. Unfortunately, the board’s
1252%% block ram can not fit horizons larger than 16.
1253
1254\begin{figure}
1255 \center
1256 \includegraphics[width=\linewidth]{rtf_bars}
1257 \caption{RTF of the fastest simulator of each type. The four colors
1258 represent the four implementation styles.}
1259 \label{fig:rtf}
1260\end{figure}
1261
1262
1263\subsection{Energy Usage}
1264
1265We use the Terasic Dashboard GUI to measure the energy usage of our
1266fastest simulator -- horiz/multi/s with the parameters H=16, UW=32,
1267and SC=32. The dashboard connects to the Agilex 7 via a MAX 10 device
1268that continuously monitors the voltage and current rails going into
1269the board and the FPGA itself \citep{dkug}, allowing us to measure
1270both a ‘‘pessimistic’’ power consumption for the whole board (which
1271includes unused peripherals that draw power) and an ‘‘optimistic’’ one
1272for only the FPGA fabric. Due to the device’s low sampling rate we
1273compute the total energy usage as the product of the maximum power
1274draw and the simulation time. According to our measurements, the
1275simulator pessimistically requires 44.9 W $\cdot$ 8.13 s = 101 mWh and
1276optimistically 16.3 W $\cdot$ 8.13 s = 37 mWh to simulate 10 seconds
1277of biological time. The pessimistic energy per synaptic event (metric
1278defined in \cite{vanalbada2018}) is 21 nJ and the optimistic one 9
1279nJ. These values are upper bounds and the actual energy usage may be
1280lower.
1281
1282\section{Discussion}
1283\label{sec:disc}
1284
1285Our main contribution is the presentation and analysis of methods for
1286creating FPGA-based simulators competitive in speed and energy usage
1287with the state-of-the-art in SNN simulation. We hope that others will
1288create even faster simulators by adopting and refining our algorithms
1289and implementation techniques for their hardware. Some methods are
1290endogenous to our hardware. For example, by running multiple kernels
1291in parallel that communicate with each other via channels, we overlap
1292different phases of the simulation algorithm. This technique has no
1293direct GPU-equivalent as different kernel types cannot communicate. It
1294also relies on the Intel-specific channel extension and may not work
1295well even on other vendors’ FPGAs. On the other hand, the technique
1296for interleaving synapses to improve banking and reduce conflicts
1297(\cref{sec:disjoint}) is adaptable to non-FPGA hardware. The same goes
1298for the horizon-based transfer algorithm which should suit any device
1299with enough fast local memory.
1300
1301Push-based SNN simulation mandates irregular memory accesses and has
1302low computational intensity; qualities that it shares with many other
1303graph processing problems. This is evidenced by the fact that most of
1304our simulators use less than 10% of the FPGA’s DSP resources, while
1305many consume over 80% of its on-chip memory. More or even faster
1306arithmetic resources would be useless. However, more on-chip
1307memory would be very beneficial because we could use the basic spike
1308transfer algorithm and would not need to bother with the more complex
1309horizon algorithm. And higher bandwidth and lower latency memories
1310would allow the simulators to transmit spikes faster. This finding
1311implies that simulators would probably achieve excellent performance
1312with neuron models more complex than LIF -- as long as the
1313interactions between neurons remain the same and as long as their
1314memory consumption does not grow.
1315
1316\subsection{FPGAs for HPC}
1317
1318We believe that FPGAs have unique advantages for HPC; they can be
1319tailored for the problem at hand thanks to their
1320malleability. However, FPGAs and particularily \textit{tools for
1321 designing for FPGAs} have many glaring weaknesses. Unlike compilers
1322for software, which emit the same machine code for the same source
1323code, synthesizers use optimization algorithms based on randomness so
1324good performance is contingent on choosing lucky random seeds. Even if
1325the differences between lucky and unlucky seeds are small, the
1326randomness makes squeezing out the last few percent of performance
1327frustrating. An algorithmic change or parameter tuning causing a small
1328performance improvement may be due to chance. Moreover, a single
1329synthesis can take several hours even on top-of-the-line hardware,
1330compounding this problem. Writing performance-critical code is an
1331experimental process, wherein one needs to test hundreds or thousands
1332of ideas to see what yields the best performance. Long turn-around
1333times slows the process down. Simulators do not help as their
1334performance does not reflect the performance of real hardware. For
1335these reasons, FPGAs are decidedly more complex to develop for than
1336GPUs.
1337
1338What is a good distribution of FPGA resources for HPC? For us, our
1339board’s distribution is far from perfect. Our fastest designs used
1340almost all on-chip memory, with plenty of ALUTs, FFs, and DSPs to
1341spare. If our use-case and designs are close to the norm then it would
1342be wise for FPGA vendors to trade-off logic resources for more on-chip
1343memory. And trends in the HPC field indicate that memory -- not
1344arithmetic -- very much is a limiting factor for many
1345problems. However, inevitably, whatever resource distribution the
1346vendor chooses, it will not be ideal for some designs. Optimizing an
1347FPGA so that as many designs as possible can take advantage of as much
1348of its resources as possible seems exceedingly difficult.
1349
1350\subsection{HLS for HPC}
1351
1352In this work we choose HLS over traditional design methods because of
1353its purported productivity advantages. How much performance did HLS
1354cost us and how much longer would it have taken us to implement the
1355simulators in an HDL? The question is an instance of the classic
1356dichotomy between performance and productivity that appears in many
1357corners of computer science. The literature suggests that in general
1358the productivity gains of HLS are large and the performance losses are
1359either non-existent or low \citep{lahti2018, pelcat2016}. However, one
1360can ask whether this holds true for performance-critical design?
1361
1362HLS definitely allowed us to explore algorithmic ideas at a rapid
1363pace. In particular, scheduling stallable loops and interfacing with
1364DDR would have been at least an order of magnitude more work to
1365implement (and debug!) in an HDL. It also helped that most -- but not
1366all -- of the OpenCL code we wrote were runnable verbatim on non-FPGA
1367targets. However, for performance judicious use of compiler
1368directivies is essential, something we struggled with. For example, adding
1369\verb!#pragma disable_loop_pipelining! on a loop in a gmem simulator
1370increased its operating frequency by almost 200 MHz since the compiler
1371could deduce that a memory port would not be shared across loop
1372iterations. On the other hand, removing the directive on a loop in a
1373horiz simulator more than doubled its performance! On several
1374occassions, misplaced \verb!#pragma ivdep! directives caused bugs that
1375were difficult to troubleshoot. Having to ‘‘nudge’’ the compiler via
1376directives to make the right decisions made us feel like we were not
1377fully in control and was at times frustrating.
1378
1379As we have no baseline to compare with, estimating the performance
1380cost of HLS is difficult. Our designs run at over 600 MHz which --
1381while a fair bit lower than the theoretical limit -- is in the upper
1382range of what typical Agilex 7 HDL designs runs at. If we are correct
1383in that performance is mostly bounded by memory resources, then
1384non-optimal performance is due to algorithmic choices and not due to
1385the choice of implementation language.
1386
1387\subsection{Future Research}
1388
1389\begin{figure}
1390 \center
1391 \includegraphics[width=\linewidth]{delays}
1392 \caption{Synaptic delay distributions for excitatory (blue) and
1393 inhibitory synapses (gold). Almost all probability mass lies to
1394 the left of 32 (dashed line).}
1395 \label{fig:delays}
1396\end{figure}
1397
1398Time constraints forced us to leave many ideas for better performance
1399unexplored. We list some of them here.
1400
1401Our results demonstrate that single-precision floating-point is
1402sufficient. Half-precision or some 16-bit fixed-point representation
1403could also be sufficient. If so, four bytes would be enough to
1404represent synapses which would -- more or less -- cut our simulator’s
1405memory demand in half.\footnote{Assuming we store some bits of the
1406delay and destination neuron index implicity in the index.} And, as the
1407synapses current is sampled from Gaussians with known means, the
1408representation’s size could perhaps be reduced further by storing the
1409current as \textit{deviations from the mean}. Presumably, fewer bits are
1410needed to represent the deviations accurately since they are rather
1411small.
1412
1413Along the same lines, one could exploit the fact that, as
1414\cref{fig:delays} shows, synaptic delays are not uniformly
1415distributed. Something our algorithms are oblivious to. It means, for
1416example, that the gmem simulators needto use off-chip memory for the
1417spike transfer array. One could perhaps retain the spike transfer
1418array for ‘‘fast’’ synapses with delays shorter than, say, 32 and
1419another more space-efficient format for ‘‘slow’’ synapses with longer
1420delays.
1421
1422As we have emphasized, we cannot safely transfer spikes from two
1423neurons simultaneoulsy as they may write to the same memory
1424locations. But this is only true if they have synapses with the same
1425destination neuron and delay. We can apply the idea from
1426\cref{sec:disjoint} on the ‘‘neuron level’’ and partition the
1427neurons into disjoint classes so that synapses of neurons of
1428different classes never trigger writes to the same memory locations.
1429
1430Our FPGA has four DDR memories each equipped with a 512 bit read/write
1431port. We are likely not utilizing all bandwidth properly since we use
1432the compiler’s default off-chip memory organization. Plausibly, we
1433could see a substantial performance improvement by carefully laying
1434out off-chip memory manually. For example, by distributing the
1435synaptic data (by far the largest data structure) over the four
1436memories.
1437
1438Another line of research could be to deploy a cluster of FPGAs for SNN
1439simulation. As the communication between the devices likely becomes
1440the bottleneck, it is a much different problem from using one device.
1441
1442\subsection{State-of-the-art}
1443\label{sec:sota}
1444\begin{table*}
1445 \centering
1446 \footnotesize
1447 \begin{tabular}{lllrrr}
1448 \toprule
1449 \textbf{Work} & \textbf{Simulator} & \textbf{Hardware} & \textbf{Node} & \textbf{RTF} & \textbf{Syn.Ev En.}\\
1450 \midrule
1451 \textbf{This work} & \textbf{horiz/mono/s} & \textbf{1 Agilex 7 FPGA} & \textbf{10} & \textbf{0.81} & \textbf{25}\\
1452 \cite{heittmann2022} & IBM INC-3000 & 432 Xilinx XC Z7045 SoC & 28 & 0.25 & 783\\
1453 \cite{kauth2023b} & neuroAIx & 35 NetFPGA SUME & 28 & 0.05 & 48\\
1454 \cite{golosio2021} & NeuronGPU & 1 GeForce RTX 2080 Ti & 12 & 1.06 & 180\\
1455 \cite{golosio2021} & NeuronGPU & 1 Tesla V100 & 12 & 1.64 & -\\
1456 \cite{knight2018} & GeNN & 1 GeForce RTX 2080 Ti & 12 & 1.40 & -\\
1457 \cite{knight2018} & GeNN & 1 Tesla V100 & 12 & 2.16 & 470\\
1458 \cite{vanalbada2018} & SpiNNaker & 217 ASIC & 130 & 20.00 & 5900\\
1459 \cite{rhodes2020} & SpiNNaker & 318 ASIC & 130 & 1.00 & 600\\
1460 \cite{kurth2022} & NEST & 2 AMD EPYC Rome & 14 & 0.53 & 480\\
1461 \bottomrule
1462 \end{tabular}
1463 \caption{State-of-the-art in hardware-accelerated microcircuit
1464 simulation. Node refers to the technology node in nanometers, RTF
1465 how much slower than realtime the simulator runs (lower is
1466 better), and Syn.Ev En. estimated energy consumption per synaptic
1467 event in nano-Joule.}
1468 \label{tab:power}
1469\end{table*}
1470\begin{figure}
1471 \center
1472 \includegraphics[width=\linewidth]{competition}
1473 \caption{RTFs of some SNN simulators}
1474 \label{fig-rtf}
1475\end{figure}
1476\begin{figure}
1477 \center
1478 \includegraphics[width=\linewidth]{energy}
1479 \caption{Energy per synaptic event of some SNN simulators}
1480 \label{fig-energy}
1481\end{figure}
1482
1483While we are far from the first to have investigated FPGA-based SNN
1484simulation, our investigation is qualitatively different from most
1485others. First, our focus is accurate SNN simulation and not solving a
1486specific
1487task. \cite{gupta2020,han2020,li2021,zheng2021,carpegna2022,liu2023,carpegna2024}
1488all implement layered SNNs that excel at classifying images from the
1489MNIST dataset. For them classification accuracy is much more important
1490than simulation speed or energy consumption. Second, many FPGA
1491simulators simulate SNNs much smaller than the Potjans-Diesmann
1492microcircuit or uses non-LIF neurons, making their results
1493incomparable to ours. For example, \cite{pani2017} simulates up to 1,440
1494Izhikevich neurons in real-time and \cite{shama2020} simulates 150
1495Hodgkin-Huxley neurons on a Virtex-2 FPGA. In contrast, NeuroFlow by
1496\cite{cheung2016} simulates up to 600,000 neurons on a cluster of six
1497FPGAs four times slower than real-time. However, it uses a
14981 ms timestemp (ten times larger than the \textit{de facto} standard
14990.1 ms), organizes neurons into two-dimensional grids, and spatially
1500constrains synapses. That is, most synapses connect to the neuron’s
1501nearest neighbours. \cite{trensch2022} proposes a simulator for a
1502Zynq-7000 SoC and measure its performance on a network of 800
1503excitatory and 200 inhibitory Izhikevich neurons. Their architecture
1504divides the FPGA into 16 identical processing blocks, each supporting
150564 neurons or 1,024 neurons in total per chip. They argue that if a
1506cluster of their devices were deployed, it could simulate the
1507Potjans-Diesmann microcircuit seven times faster than
1508real-time. Though they did not test their claim.
1509
1510The simulators we think are the most similar to ours are: IBM INC-3000,
1511neruoAIx, NeuronGPU, GeNN, SpiNNaker, and NEST. IBM-INC 3000 and
1512neuroAIx are both FPGA clusters, consisting of 432 Xilinx XC boards
1513and 35 NetFPGA SUME boards respectively. IBM-INC 3000 is four
1514times faster than real-time and neuroAIx twenty time faster than
1515real-time \citep{kauth2023,heittmann2022}. GeNN and NeuronGPU are both
1516GPU-based simulators written in CUDA and runs on NVIDIA’s line of GPUs
1517\citep{knight2018,golosio2021}. On a Tesla V100 GPU, GeNN simulate
1518the microcircuit at about half the speed of real-time and NeuronGPU
15195% slower than real-time. SpiNNaker is an older dedicated
1520neuromorphic hardware system built out of hundreds of ASICs
1521\citep{furber2013}. NEST is a CPU-based simulator which runs
1522twice the speed of real-time on two AMD EPYC Rome nodes
1523\citep{kurth2022}.
1524
1525\Cref{tab:power,fig-rtf,fig-energy} show how our best simulator stack
1526up against the competition. The energy usage of our fastest
1527simulator compares favorably to other results. Likely
1528partially due to to the Agilex 7’s smaller technology node and to us
1529confining our implementation to one FPGA. \cite{kauth2023b} report a
1530much lower RTF than us, but use 35 boards, each drawing 26.54 W on
1531average, resulting in a higher total energy usage for the same amount
1532of biological time. They, like us, report an ‘‘all-inclusive’’ value
1533for the energy usage so a dedicated platform -- without any unused
1534peripherals -- could consume much less energy. It goes without
1535saying that fairly comparing performance of systems implemented on
1536different architectures, with different design trade-offs, and
1537accuracy constraints is tricky. One system with excellent performance
1538may be inadequate in other regards. For example, less configurability
1539may improve a system’s performance, but make it unusable for certain
1540applications. It should be noted that most simulators, unlike ours,
1541replace thalamic spikes with DC input which limits their
1542applicability.
1543
1544
1545
1546
1547
1548
1549
1550
1551%% Fairly comparing performance of systems implemented on different
1552%% architectures, with different design trade-offs, and accuracy
1553%% constraints is tricky. One system with excellent performance may be
1554%% inadequate in other regards. For example, less configurability may
1555%% improve a system’s performance, but make it unusable for certain
1556%% applications. Nevertheless, \cref{tab:power,fig-rtf,fig-energy}
1557%% compare our best simulator’s RTF and energy per synaptic event
1558%% (defined in \cref{sec:results}) with that of some well-known
1559%% simulators from the literature. GeNN and NeuronGPU are two GPU-based
1560%% simulators written in CUDA and runs on NVIDIA’s line of GPUs
1561%% \citep{knight2018, golosio2021}. On a Tesla V100 GPU, GeNN simulate
1562%% the microcircuit at about half the speed of real-time and NeuronGPU
1563%% 5% slower than real-time. IBM INC-3000 and neuroAIx are two FPGA
1564%% clusters that run four times faster than real-time and twenty times
1565%% faster than real-time \citep{kauth2023, heittmann2022}. They consist
1566%% of 432 Xilinx XC SoC boards and 35 NetFPGA SUME boards,
1567%% respectively. SpiNNaker is an older dedicated neuromorphic hardware
1568%% system built out of hundreds of ASICs \citep{furber2013}. It should be
1569%% noted that, unlike ours, most of these simulators replace thalamic
1570%% spikes with DC input.
1571
1572%% As \cref{fig-energy} indicates, the energy usage of our best simulator
1573%% compares favorably to published results. This is likely partially due
1574%% to to the Agilex 7’s smaller technology node and to us confining our
1575%% implementation to one FPGA. \cite{kauth2023b} report a much lower RTF
1576%% than us, but uses 35 boards, each of which on average draws 26.54 W
1577%% which results in more energy use for the same amount of biological
1578%% simulation time. They, like us, report an ‘‘all-inclusive’’ value for
1579%% the energy usage so a dedicated platform -- without any unused
1580%% peripherals -- would conceivably use much less energy.
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590%% In addition to the above presented simulators, a number of FPGA
1591%% implementations simulate relatively small networks, which we contend
1592%% is materially different from simulating large networks. The modular
1593%% and efficient FPGA design \cite{pani2017} created simulates networks
1594%% of up to 1,440 Izhikevich neurons in real-time with a time-step of 0.1
1595%% ms. However, their design assumed a fuly-connected connectome and did
1596%% not model multapses nor spike transmission over synapses. The
1597%% simulator created by \cite{shama2020} simulated a 150-neuron
1598%% Hodgkin-Huxley (HH) SNN on a Virtex-2 FPGA. While the HH neuronal
1599%% model is more complicated than the LIF model, the authors, like
1600%% \cite{pani2017}, did not account for spike transmission delays in
1601%% their simulator.
1602
1603%% Other implementations are not for simulation per se, but for executing
1604%% a specific task. Those include the layered FPGA-based
1605%% SNN-implementations presented in \cite{gupta2020, han2020, li2021,
1606%% zheng2021, carpagena2022, liu2023, carpegna2024} which classify
1607%% images from the MNIST dataset. They all use layered SNNs which are
1608%% materially different from disorganized ones.
1609
1610%% \section{State-of-the-art}
1611
1612%% \begin{table*}[t]
1613%% \centering
1614%% \footnotesize
1615%% \begin{tabular}{llr}
1616%% \toprule
1617%% \textbf{Article} & \textbf{Board} & \textbf{MNIST accuracy}\\
1618%% \midrule
1619%% \cite{zheng2021} & Xilinx ZCU102 & 90.53%\\
1620%% \bottomrule
1621%% \end{tabular}
1622%% \caption{SNNs for classification on one FPGA}
1623%% \end{table*}
1624
1625%% \begin{table*}
1626%% \centering
1627%% \footnotesize
1628%% \begin{tabular}{lllrr}
1629%% \toprule
1630%% \textbf{Article} & \textbf{Board} & \textbf{Neuron model} & \textbf{Neuron cnt.} & \textbf{Time step}\\
1631%% \midrule
1632%% \cite{pani2017} & Xilinx XC 6VLX240T & Izhikevich & 1,440 & 0.1 ms\\
1633%% \bottomrule
1634%% \end{tabular}
1635%% \caption{SNNs for simulation of small networks on one FPGA}
1636%% \end{table*}
1637
1638
1639%% \begin{table*}[t]
1640%% \centering
1641%% \footnotesize
1642%% \begin{tabular}{llllrrr}
1643%% \toprule
1644%% \textbf{Article} & \textbf{Hardware} & \textbf{Neuron} & \textbf{Connectome} & \textbf{Neurons} & \textbf{Synapses} & \textbf{Sim. Speed}\\
1645%% \midrule
1646%% \cite{han2020} & Single FPGA & LIF & ?? & ?? & ?? & ?\\
1647%% \cite{trensch2022} & Single FPGA & IZ & ?? & 1,000 & ?? & ??\\
1648%% \cite{carpegna2024} & Single FPGA & LIF & Layered & & & n/a\\
1649%% \bottomrule
1650%% \end{tabular}
1651%% \caption{Overview of simulators for disorganized SNNs}
1652%% \end{table*}
1653
1654%% In this section we gauge the state-of-the-art for SNN-simulation on
1655%% FPGAs and explain how it relates to the implementation we
1656%% propose. Characterizing the state-of-the-art is challenging due to
1657%% differences in hardware, network’s intended purpose, implementation
1658%% techniques, correctness constraints, simulation resolution,
1659%% etc. Performance is unlikely to scale linearly with network size; a
1660%% result from a small network can’t easily be compared with a result
1661%% from a larger network. For instance, memory bottlenecks can cause
1662%% catastrophic degradations in performance as memory demand
1663%% increases. And, as discussed in \cref{sec:topo}, topology is extremely
1664%% important so implementations for layered SNNs cannot be fruitfully
1665%% compared with simulators for more brain-like, disorganized SNNs even
1666%% when targeting the same hardware.
1667
1668%% Some simulators are designed for FPGA clusters, or other custom
1669%% large-scale architetures, which for obvious reasons makes it difficult
1670%% to compare their perfromance with simulators utilizing a single board
1671%% (such as ours). An example of an early simulator for FPGA clusters is
1672%% NeuroFlow \citep{cheung2016}. It simulated up to 600,000 neurons on
1673%% six FPGAs only about four times slower than real time. However, the
1674%% time step was set to 1 ms (ten times larger than the \textit{de facto}
1675%% standard 0.1 ms used in many other simulators), neurons were organized
1676%% into two-dimensional grids, and synapses were spatially
1677%% constrained. That is, most synapses connected a neuron with its
1678%% immediate north, south, east, or west neighbour. neuroAIx is a more
1679%% recent FPGA cluster-based simulator \citep{kauth2023}. It uses 35
1680%% NetFPGA SUME (SoC based on Virtex 7) nodes and simulates the
1681%% Potjans-Diesmann microcircuit 20 times faster than biological
1682%% realtime. Like many of its predecessors, it approximates thalamic
1683%% input as direct current. Another large-scale architecture is the
1684%% neural supercomputer IBM INC-3000 \citep{heittmann2022}. It uses a
1685%% whop** 432 Xilinx XC Z7045 SoC and simulates Potjans-Diesmann’s
1686%% microcircuit four times faster than realtime.
1687
1688
1689
1690%% For completeness sake, we also survey some GPU-based
1691%% SNN-simulators. GeNN presented in \cite{knight2018} simulates the
1692%% Potjans-Diesmann microcircuit at half the speed of real-time on a
1693%% Tesla V100 GPU. The GPU library NeuronGPU, presented in
1694%% \cite{golosio2021}, is capable of simulating the same network about
1695%% 5% slower than realtime.
1696
1697%% \cite{han2020} implemented an LIF SNN on an Xilinx ZC706 FPGA for
1698%% classifying images from the MNIST dataset. They found that the network
1699%% could classify 161 frames per second with an accuracy of 97%, only
1700%% drawing 0.477W of power. For updating neurons they used a novel hybrid
1701%% algorithm that was neither fully synchronous nor fully
1702%% asynchronous. They benchmarked their FPGA implementation against an
1703%% SNN implemented in PyTorch running on an NVIDIA P100 GPU. Since
1704%% asynchronous simulation is not suitable for GPUs, it used a
1705%% synchronous update algorithm instead. While the GPU implementation was
1706%% faster than the FPGA implementation it used almost ten times as much
1707%% power per image. As the focus of their research was classification
1708%% accuracy rather than \textit{simulation accuracy} it is not directly
1709%% comparable to ours. Furthermore, their network was an order of
1710%% magnitude smaller than ours, all their synapses were non-recurrent,
1711%% and their neurons organized into four fully-connected layers.
1712
1713%% \cite{trensch2022} designed a simulator for a Zynq-7000 SoC
1714%% device. They measured its performance on a two-population
1715%% Izhikevich-type neurons containing 800 excitatory and 200 inhibitory
1716%% neurons. Their architecture divides the available FPGA into 16
1717%% identical processing blocks which each supports 64 neurons. Thus, in
1718%% total it supports up to 1,024 neurons per chip. The authors argued
1719%% that if a cluster of their devices were deployed, it could simulate
1720%% the Potjans-Diesmann microcircuit seven times faster than
1721%% realtime. Though this claim was not tested. Unlike our simulator,
1722%% theirs do not account for thalamic input.
1723
1724
1725\section*{Author Contributions}
1726BAL and AP designed the study. BAL implemented the SNN framework and
1727performed the experiments. BAL and AP analyzed the results and
1728co-wrote the paper.
1729
1730\section*{Funding}
1731This work was supported by the Swedish Research Council through the
1732project ‘‘Building Digital Brains’’ (grant reference 2021-04579).
1733
1734\section*{Acknowledgments}
1735TODO
1736
1737
1738\section*{Supplemental Data}
1739 \href{http://home.frontiersin.org/about/author-guidelines#SupplementaryMaterial}{Supplementary Material} should be uploaded separately on submission, if there are Supplementary Figures, please include the caption in the same file as the figure. LaTeX Supplementary Material templates can be found in the Frontiers LaTeX folder.
1740
1741\section*{Data Availability Statement}
1742
1743All source code will eventually be available under a permissible Open
1744Source license.
1745
1746
1747\bibliographystyle{Frontiers-Harvard}
1748\bibliography{frontiers}
1749
1750\end{document}