Search | arXiv e-print repository

doi 10.1145/3624062.3624231

Short reasons for long vectors in HPC CPUs: a study based on RISC-V

Authors: Pablo Vizcaino, Georgios Ieronymakis, Nikolaos Dimou, Vassilis Papaefstathiou, Jesus Labarta, Filippo Mantovani

Abstract: For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavio… ▽ More For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra. △ Less

Submitted 11 November, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis Denver CO USA November 12 - 17, 2023

arXiv:2307.09371 [pdf, other]

The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack

Authors: Manolis Ploumidis, Fabien Chaix, Nikolaos Chrysos, Marios Assiminakis, Vassilis Flouris, Nikolaos Kallimanis, Nikolaos Kossifidis, Michael Nikoloudakis, Polydoros Petrakis, Nikolaos Dimou, Michael Gianioudis, George Ieronymakis, Aggelos Ioannou, George Kalokerinos, Pantelis Xirouchakis, George Ailamakis, Astrinos Damianakis, Michael Ligerakis, Ioannis Makris, Theocharis Vavouris, Manolis Katevenis, Vassilis Papaefstathiou, Manolis Marazakis, Iakovos Mavroidis

Abstract: We present and evaluate the ExaNeSt Prototype, a liquid-cooled rack prototype consisting of 256 Xilinx ZU9EG MPSoCs, 4 TBytes of DRAM, 16 TBytes of SSD, and configurable interconnection 10-Gbps hardware. We developed this testbed in 2016-2019 to validate the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accele… ▽ More We present and evaluate the ExaNeSt Prototype, a liquid-cooled rack prototype consisting of 256 Xilinx ZU9EG MPSoCs, 4 TBytes of DRAM, 16 TBytes of SSD, and configurable interconnection 10-Gbps hardware. We developed this testbed in 2016-2019 to validate the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. We present our key design choices reagrding overall system architecture, PCBs and runtime software, and summarize insights resulting from measurement and analysis. Of particular note, our custom interconnect includes a low-cost low-latency network interface, offering user-level zero-copy RDMA, which we have tightly coupled with the ARMv8 processors in the MPSoCs. We have developed a system software runtime on top of these features, and have been able to run MPI. We have evaluated our testbed through MPI microbenchmarks, mini, and full MPI applications. Single hop, one way latency is $1.3$~$μ$s; approximately $0.47$~$μ$s out of these are attributed to network interface and the user-space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching $2.55$~$μ$s for a five-hop path. Bandwidth tests show that, for a single hop, link utilization reaches $82\%$ of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to $88\%$. We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least $69\%$, or better. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 45 pages, 23 figures

Report number: Report-no:TR-488

arXiv:2306.01797 [pdf, other]

Software Development Vehicles to enable extended and early co-design: a RISC-V and HPC case of study

Authors: Filippo Mantovani, Pablo Vizcaino, Fabio Banchelli, Marta Garcia-Gasulla, Roger Ferrer, Giorgos Ieronymakis, Nikos Dimou, Vassilis Papaefstathiou, Jesus Labarta

Abstract: Prototy** HPC systems with low-to-mid technology readiness level (TRL) systems is critical for providing feedback to hardware designers, the system software team (e.g., compiler developers), and early adopters from the scientific community. The typical approach to hardware design and HPC system prototy** often limits feedback or only allows it at a late stage. In this paper, we present a set o… ▽ More Prototy** HPC systems with low-to-mid technology readiness level (TRL) systems is critical for providing feedback to hardware designers, the system software team (e.g., compiler developers), and early adopters from the scientific community. The typical approach to hardware design and HPC system prototy** often limits feedback or only allows it at a late stage. In this paper, we present a set of tools for co-designing HPC systems, called software development vehicles (SDV). We use an innovative RISC-V design as a demonstrator, which includes a scalar CPU and a vector processing unit capable of operating large vectors up to 16 kbits. We provide an incremental methodology and early tangible evidence of the co-design process that provide feedback to improve both architecture and system software at a very early stage of system development. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Presented at the "First International workshop on RISC-V for HPC" co-located with ISC23 in Hamburg

arXiv:2302.03066 [pdf, ps, other]

On the equivalence between the minimax theorem and strong duality of conic linear programming

Authors: Nikos Dimou

Abstract: We prove the almost equivalence between two-player zero-sum games and conic linear programming problems in reflexive Banach spaces. The previous fundamental results of von Neumann, Dantzig, Adler, and von Stengel, regarding the equivalence between finite games with strategy sets defined over $\mathbb{R}^n$, and linear programming, are therefore generalized to the infinite-dimensional case. In fact… ▽ More We prove the almost equivalence between two-player zero-sum games and conic linear programming problems in reflexive Banach spaces. The previous fundamental results of von Neumann, Dantzig, Adler, and von Stengel, regarding the equivalence between finite games with strategy sets defined over $\mathbb{R}^n$, and linear programming, are therefore generalized to the infinite-dimensional case. In fact, we show that for every two-player zero-sum game with a bilinear function of the form $u(x,y)=\langle y,Ax\rangle$, for some linear operator $A$, and strategy sets that represent bases of convex cones, the minimax theorem holds, and its game value and Nash equilibria can be computed by solving a primal-dual pair of conic linear problems. Conversely, the minimax theorem for the same class of games "almost always" implies strong duality of conic linear programming. The main results are applied to a number of infinite zero-sum games, whose classes include those of semi-infinite, semidefinite, time-continuous, quantum, polynomial, and homogeneous separable games. △ Less

Submitted 5 June, 2024; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: Replaced a very early version which was uploaded while the author was still an undergraduate student; 28 pages

MSC Class: 90C25; 91A05 (Primary); 90C47; 91A70 (Secondary)

arXiv:2003.03283 [pdf, other]

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

Authors: David Goz, Georgios Ieronymakis, Vassilis Papaefstathiou, Nikolaos Dimou, Sara Bertocco, Francesco Simula, Antonio Ragagnin, Luca Tornatore, Igor Coretti, Giuliano Taffoni

Abstract: New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a pr… ▽ More New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound phase of innovation, in which the primary challenge to the achievement of the "Exascale" is the power-consumption. The goal of this work is to give some insights about performance and energy footprint of contemporary architectures for a real astrophysical application in an HPC context. We use a state-of-the-art N-body application that we re-engineered and optimized to exploit the heterogeneous underlying hardware fully. We quantitatively evaluate the impact of computation on energy consumption when running on four different platforms. Two of them represent the current HPC systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a "prototype towards Exascale" equipped with ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the different devices where the high-end GPUs excel in terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience reveals that considering FPGAs for computationally intensive application seems very promising, as their performance is improving to meet the requirements of scientific applications. This work can be a reference for future platforms development for astrophysics applications where computationally intensive calculations are required. △ Less

Submitted 10 April, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: 15 pages, 4 figures, 3 tables; Preprint (V2) submitted to MDPI (Special Issue: Energy-Efficient Computing on Parallel Architectures)

arXiv:1910.14496 [pdf, other]

Direct N-body application on low-power and energy-efficient parallel architectures

Authors: D. Goz, G. Ieronymakis, V. Papaefstathiou, N. Dimou, S. Bertocco, A. Ragagnin, L. Tornatore, G. Taffoni, I. Coretti

Abstract: The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct $N$-body code from the astroph… ▽ More The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct $N$-body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations. We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to "traditional" technologies for HPC, which purely focus on peak-performance than on power-efficiency. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: 10 pages, 5 figure, 2 tables; The final publication will be available at IOS Press

Showing 1–6 of 6 results for author: Dimou, N