-
The integrated low-level trigger and readout system of the CERN NA62 experiment
Authors:
R. Ammendola,
B. Angelucci,
M. Barbanera,
A. Biagioni,
V. Cerny,
B. Checcucci,
R. Fantechi,
F. Gonnella,
M. Koval,
M. Krivda,
G. Lamanna,
M. Lupi,
A. Lonardo,
A. Papi,
C. Parkinson,
E. Pedreschi,
P. Petrov,
R. Piandani,
J. Pinzino,
L. Pontisso,
M. Raggi,
D. Soldi,
M. S. Sozzi,
F. Spinella,
S. Venditti
, et al. (1 additional authors not shown)
Abstract:
The integrated low-level trigger and data acquisition (TDAQ) system of the NA62 experiment at CERN is described. The requirements of a large and fast data reduction in a high-rate environment for a medium-scale, distributed ensemble of many different sub-detectors led to the concept of a fully digital integrated system with good scaling capabilities. The NA62 TDAQ system is rather unique in allowi…
▽ More
The integrated low-level trigger and data acquisition (TDAQ) system of the NA62 experiment at CERN is described. The requirements of a large and fast data reduction in a high-rate environment for a medium-scale, distributed ensemble of many different sub-detectors led to the concept of a fully digital integrated system with good scaling capabilities. The NA62 TDAQ system is rather unique in allowing full flexibility on this scale, allowing in principle any information available from the detector to be used for triggering. The design concept, implementation and performances from the first years of running are illustrated.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Real-time cortical simulations: energy and interconnect scaling on distributed systems
Authors:
Francesco Simula,
Elena Pastorelli,
Pier Stanislao Paolucci,
Michele Martinelli,
Alessandro Lonardo,
Andrea Biagioni,
Cristiano Capone,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Luca Pontisso,
Piero Vicini,
Roberto Ammendola
Abstract:
We profile the impact of computation and inter-processor communication on the energy consumption and on the scaling of cortical simulations approaching the real-time regime on distributed computing platforms. Also, the speed and energy consumption of processor architectures typical of standard HPC and embedded platforms are compared. We demonstrate the importance of the design of low-latency inter…
▽ More
We profile the impact of computation and inter-processor communication on the energy consumption and on the scaling of cortical simulations approaching the real-time regime on distributed computing platforms. Also, the speed and energy consumption of processor architectures typical of standard HPC and embedded platforms are compared. We demonstrate the importance of the design of low-latency interconnect for speed and energy consumption. The cost of cortical simulations is quantified using the Joule per synaptic event metric on both architectures. Reaching efficient real-time on large scale cortical simulations is of increasing relevance for both future bio-inspired artificial intelligence applications and for understanding the cognitive functions of the brain, a scientific quest that will require to embed large scale simulations into highly complex virtual or real worlds. This work stands at the crossroads between the WaveScalES experiment in the Human Brain Project (HBP), which includes the objective of large scale thalamo-cortical simulations of brain states and their transitions, and the ExaNeSt and EuroExa projects, that investigate the design of an ARM-based, low-power High Performance Computing (HPC) architecture with a dedicated interconnect scalable to million of cores; simulation of deep sleep Slow Wave Activity (SWA) and Asynchronous aWake (AW) regimes expressed by thalamo-cortical models are among their benchmarks.
△ Less
Submitted 26 November, 2019; v1 submitted 12 December, 2018;
originally announced December 2018.
-
Search for $K^{+}\rightarrowπ^{+}ν\overlineν$ at NA62
Authors:
NA62 Collaboration,
G. Aglieri Rinella,
R. Aliberti,
F. Ambrosino,
R. Ammendola,
B. Angelucci,
A. Antonelli,
G. Anzivino,
R. Arcidiacono,
I. Azhinenko,
S. Balev,
M. Barbanera,
J. Bendotti,
A. Biagioni,
L. Bician,
C. Biino,
A. Bizzeti,
T. Blazek,
A. Blik,
B. Bloch-Devaux,
V. Bolotov,
V. Bonaiuto,
M. Boretto,
M. Bragadireanu,
D. Britton
, et al. (227 additional authors not shown)
Abstract:
$K^{+}\rightarrowπ^{+}ν\overlineν$ is one of the theoretically cleanest meson decay where to look for indirect effects of new physics complementary to LHC searches. The NA62 experiment at CERN SPS is designed to measure the branching ratio of this decay with 10\% precision. NA62 took data in pilot runs in 2014 and 2015 reaching the final designed beam intensity. The quality of 2015 data acquired,…
▽ More
$K^{+}\rightarrowπ^{+}ν\overlineν$ is one of the theoretically cleanest meson decay where to look for indirect effects of new physics complementary to LHC searches. The NA62 experiment at CERN SPS is designed to measure the branching ratio of this decay with 10\% precision. NA62 took data in pilot runs in 2014 and 2015 reaching the final designed beam intensity. The quality of 2015 data acquired, in view of the final measurement, will be presented.
△ Less
Submitted 24 July, 2018;
originally announced July 2018.
-
Large Scale Low Power Computing System - Status of Network Design in ExaNeSt and EuroExa Projects
Authors:
Roberto Ammendola,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Pier Stanislao Paolucci,
Elena Pastorelli,
Luca Pontisso,
Francesco Simula,
Piero Vicini
Abstract:
The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of techno…
▽ More
The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of technologies characterized by low power, high efficiency and high degree of customization is strongly needed. Among the various European initiative targeting the design of ExaFlops system, ExaNeSt and EuroExa are EU-H2020 funded initiatives leveraging on high end MPSoC FPGAs. Last generation MPSoC FPGAs can be seen as non-mainstream but powerful HPC Exascale enabling components thanks to the integration of embedded multi-core, ARM-based low power CPUs and a huge number of hardware resources usable to co-design application oriented accelerators and to develop a low latency high bandwidth network architecture. In this paper we introduce ExaNet the FPGA-based, scalable, direct network architecture of ExaNeSt system. ExaNet allow us to explore different interconnection topologies, to evaluate advanced routing functions for congestion control and fault tolerance and to design specific hardware components for acceleration of collective operations. After a brief introduction of the motivations and goals of ExaNeSt and EuroExa projects, we will report on the status of network architecture design and its hardware/software testbed adding preliminary bandwidth and latency achievements.
△ Less
Submitted 11 April, 2018;
originally announced April 2018.
-
The Brain on Low Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States
Authors:
Roberto Ammendola,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Pier Stanislao Paolucci,
Elena Pastorelli,
Luca Pontisso,
Francesco Simula,
Piero Vicini
Abstract:
Efficient brain simulation is a scientific grand challenge, a parallel/distributed coding challenge and a source of requirements and suggestions for future computing architectures. Indeed, the human brain includes about 10^15 synapses and 10^11 neurons activated at a mean rate of several Hz. Full brain simulation poses Exascale challenges even if simulated at the highest abstraction level. The Wav…
▽ More
Efficient brain simulation is a scientific grand challenge, a parallel/distributed coding challenge and a source of requirements and suggestions for future computing architectures. Indeed, the human brain includes about 10^15 synapses and 10^11 neurons activated at a mean rate of several Hz. Full brain simulation poses Exascale challenges even if simulated at the highest abstraction level. The WaveScalES experiment in the Human Brain Project (HBP) has the goal of matching experimental measures and simulations of slow waves during deep-sleep and anesthesia and the transition to other brain states. The focus is the development of dedicated large-scale parallel/distributed simulation technologies. The ExaNeSt project designs an ARM-based, low-power HPC architecture scalable to million of cores, develo** a dedicated scalable interconnect system, and SWA/AW simulations are included among the driving benchmarks. At the joint between both projects is the INFN proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation engine. DPSNN can be configured to stress either the networking or the computation features available on the execution platforms. The simulation stresses the networking component when the neural net - composed by a relatively low number of neurons, each one projecting thousands of synapses - is distributed over a large number of hardware cores. When growing the number of neurons per core, the computation starts to be the dominating component for short range connections. This paper reports about preliminary performance results obtained on an ARM-based HPC prototype developed in the framework of the ExaNeSt project. Furthermore, a comparison is given of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server platforms.
△ Less
Submitted 10 April, 2018;
originally announced April 2018.
-
Gaussian and exponential lateral connectivity on distributed spiking neural network simulation
Authors:
Elena Pastorelli,
Pier Stanislao Paolucci,
Francesco Simula,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Luca Pontisso,
Piero Vicini,
Roberto Ammendola
Abstract:
We measured the impact of long-range exponentially decaying intra-areal lateral connectivity on the scaling and memory occupation of a distributed spiking neural network simulator compared to that of short-range Gaussian decays. While previous studies adopted short-range connectivity, recent experimental neurosciences studies are pointing out the role of longer-range intra-areal connectivity with…
▽ More
We measured the impact of long-range exponentially decaying intra-areal lateral connectivity on the scaling and memory occupation of a distributed spiking neural network simulator compared to that of short-range Gaussian decays. While previous studies adopted short-range connectivity, recent experimental neurosciences studies are pointing out the role of longer-range intra-areal connectivity with implications on neural simulation platforms. Two-dimensional grids of cortical columns composed by up to 11 M point-like spiking neurons with spike frequency adaption were connected by up to 30 G synapses using short- and long-range connectivity models. The MPI processes composing the distributed simulator were run on up to 1024 hardware cores, hosted on a 64 nodes server platform. The hardware platform was a cluster of IBM NX360 M5 16-core compute nodes, each one containing two Intel Xeon Haswell 8-core E5-2630 v3 processors, with a clock of 2.40 G Hz, interconnected through an InfiniBand network, equipped with 4x QDR switches.
△ Less
Submitted 19 February, 2019; v1 submitted 23 March, 2018;
originally announced March 2018.
-
GPU-based Real-time Triggering in the NA62 Experiment
Authors:
R. Ammendola,
A. Biagioni,
P. Cretaro,
S. Di Lorenzo,
R. Fantechi,
M. Fiorini,
O. Frezza,
G. Lamanna,
F. Lo Cicero,
A. Lonardo,
M. Martinelli,
I. Neri,
P. S. Paolucci,
E. Pastorelli,
R. Piandani,
L. Pontisso,
D. Rossetti,
F. Simula,
M. Sozzi,
P. Vicini
Abstract:
Over the last few years the GPGPU (General-Purpose computing on Graphics Processing Units) paradigm represented a remarkable development in the world of computing. Computing for High-Energy Physics is no exception: several works have demonstrated the effectiveness of the integration of GPU-based systems in high level trigger of different experiments. On the other hand the use of GPUs in the low le…
▽ More
Over the last few years the GPGPU (General-Purpose computing on Graphics Processing Units) paradigm represented a remarkable development in the world of computing. Computing for High-Energy Physics is no exception: several works have demonstrated the effectiveness of the integration of GPU-based systems in high level trigger of different experiments. On the other hand the use of GPUs in the low level trigger systems, characterized by stringent real-time constraints, such as tight time budget and high throughput, poses several challenges. In this paper we focus on the low level trigger in the CERN NA62 experiment, investigating the use of real-time computing on GPUs in this synchronous system. Our approach aimed at harvesting the GPU computing power to build in real-time refined physics-related trigger primitives for the RICH detector, as the the knowledge of Cerenkov rings parameters allows to build stringent conditions for data selection at trigger level. Latencies of all components of the trigger chain have been analyzed, pointing out that networking is the most critical one. To keep the latency of data transfer task under control, we devised NaNet, an FPGA-based PCIe Network Interface Card (NIC) with GPUDirect capabilities. For the processing task, we developed specific multiple ring trigger algorithms to leverage the parallel architecture of GPUs and increase the processing throughput to keep up with the high event rate. Results obtained during the first months of 2016 NA62 run are presented and discussed.
△ Less
Submitted 13 June, 2016;
originally announced June 2016.
-
NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features
Authors:
A. Lonardo,
F. Ameli,
R. Ammendola,
A. Biagioni,
O. Frezza,
G. Lamanna,
F. Lo Cicero,
M. Martinelli,
P. S. Paolucci,
E. Pastorelli,
L. Pontisso,
D. Rossetti,
F. Simeone,
F. Simula,
M. Sozzi,
L. Tosoratto,
P. Vicini
Abstract:
While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages.
Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU s…
▽ More
While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages.
Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU system needs careful characterization of all subsystems along data stream path.
The networking subsystem results in being the most critical one in terms of absolute value and fluctuations of its response latency.
Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable and extensible set of network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler GPU memories.
NaNet design currently supports both standard - GbE (1000BASE-T) and 10GbE (10Base-R) - and custom - 34~Gbps APElink and 2.5~Gbps deterministic latency KM3link - channels, but its modularity allows for a straightforward inclusion of other link technologies.
To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a network/transport layer offload module with cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division Multiplexing and APElink protocols.
After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino telescope.
△ Less
Submitted 13 June, 2014;
originally announced June 2014.
-
NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems
Authors:
Roberto Ammendola,
Andrea Biagioni,
Riccardo Fantechi,
Ottorino Frezza,
Gianluca Lamanna,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Felice Pantaleo,
Roberto Piandani,
Luca Pontisso,
Davide Rossetti,
Francesco Simula,
Marco Sozzi,
Laura Tosoratto,
Piero Vicini
Abstract:
We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upst…
▽ More
We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upstream root complex. Synthetic benchmarks for latency and bandwidth are presented. We describe how NaNet can be employed in the prototype of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment, to implement the data link between the TEL62 readout boards and the low level trigger processor. Results for the throughput and latency of the integrated system are presented and discussed.
△ Less
Submitted 22 November, 2013; v1 submitted 5 November, 2013;
originally announced November 2013.