Search | arXiv e-print repository

A Mess of Memory System Benchmarking, Simulation and Application Profiling

Authors: Pouya Esmaili-Dokht, Francesco Sgherzi, Valeria Soldera Girelli, Isaac Boixaderas, Mariana Carmin, Alireza Momeni, Adria Armejach, Estanislao Mercadal, German Llort, Petar Radojkovic, Miquel Moreto, Judit Gimenez, Xavier Martorell, Eduard Ayguade, Jesus Labarta, Emanuele Confalonieri, Rishabh Dubey, Jason Adlard

Abstract: The Memory stress (Mess) framework provides a unified view of the memory system benchmarking, simulation and application profiling. The Mess benchmark provides a holistic and detailed memory system characterization. It is based on hundreds of measurements that are represented as a family of bandwidth--latency curves. The benchmark increases the coverage of all the previous tools and leads to new f… ▽ More The Memory stress (Mess) framework provides a unified view of the memory system benchmarking, simulation and application profiling. The Mess benchmark provides a holistic and detailed memory system characterization. It is based on hundreds of measurements that are represented as a family of bandwidth--latency curves. The benchmark increases the coverage of all the previous tools and leads to new findings in the behavior of the actual and simulated memory systems. We deploy the Mess benchmark to characterize Intel, AMD, IBM, Fujitsu, Amazon and NVIDIA servers with DDR4, DDR5, HBM2 and HBM2E memory. The Mess memory simulator uses bandwidth--latency concept for the memory performance simulation. We integrate Mess with widely-used CPUs simulators enabling modeling of all high-end memory technologies. The Mess simulator is fast, easy to integrate and it closely matches the actual system performance. By design, it enables a quick adoption of new memory technologies in hardware simulators. Finally, the Mess application profiling positions the application in the bandwidth--latency space of the target memory system. This information can be correlated with other application runtime activities and the source code, leading to a better overall understanding of the application's behavior. The current Mess benchmark release covers all major CPU and GPU ISAs, x86, ARM, Power, RISC-V, and NVIDIA's PTX. We also release as open source the ZSim, gem5 and OpenPiton Metro-MPI integrated with the Mess memory simulator for DDR4, DDR5, Optane, HBM2, HBM2E and CXL memory expanders. The Mess application profiling is already integrated into a suite of production HPC performance analysis tools. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 17 pages

arXiv:2401.17984 [pdf, other]

doi 10.1145/3642921.3642928

Makinote: An FPGA-Based HW/SW Platform for Pre-Silicon Emulation of RISC-V Designs

Authors: Elias Perdomo, Alexander Kropotov, Francelly Cano, Syed Zafar, Teresa Cervero, Xavier Martorell, Behzad Salami

Abstract: Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large numb… ▽ More Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large number of FPGAs (in total 96 AMD/Xilinx Alveo U55c) to emulate massive size RTL designs (up to 750M ASIC cells). In addition, we introduce our FPGA shell as a powerful tool to facilitate the utilization of such a large FPGA cluster with minimal effort needed by the designers. The proposed FPGA shell provides an easy-to-use interface for the RTL developers to rapidly port such design into several FPGAs by automatically connecting to the necessary ports, e.g., PCIe Gen4, DRAM (DDR4 and HBM), ETH10g/100g. Moreover, specific drivers for exploiting RISC-V based architectures are provided within the set of tools associated with the FPGA shell. We release the tool online for further extensions. We validate the efficiency of our hardware platform (i.e., FPGA cluster) and the software tool (i.e., FPGA Shell) by emulating a RISC-V processor and experimenting HPC Challenge application running on 32 FPGAs. Our results demonstrate that the performance improves by 8 times over the single-FPGA case. △ Less

Submitted 5 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: 7 pages, 5 figures, presented in Rapid Simulation and Performance Evaluation for Design 2024 (RAPIDO24) and published in ACM Proceedings of Rapid Simulation and Performance Evaluation for Design

arXiv:2107.12371 [pdf, other]

An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations

Authors: Guillermo Oyarzun, Daniel Peyrolon, Carlos Alvarez, Xavier Martorell

Abstract: Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a… ▽ More Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a low arithmetic intensity and only reach a small percentage of the peak performance. The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. The matrix's sparsity pattern determines the indirect memory accesses of the multiplying vector. This data path is hard to predict, making traditional implementations fail. In this work, we present an FPGA architecture that maximizes the vector's re-usability by introducing a cache-like architecture. The cache is implemented as a circular list that maintains the BRAM vector components while needed. Following this strategy, up to 16 times of acceleration is obtained compared to a naive implementation of the algorithm. △ Less

Submitted 24 July, 2021; originally announced July 2021.

arXiv:2106.12485 [pdf, other]

doi 10.1007/978-3-030-85665-6_30

Particle-In-Cell Simulation using Asynchronous Tasking

Authors: Nicolas Guidotti, Pedro Ceyrat, João Barreto, José Monteiro, Rodrigo Rodrigues, Ricardo Fonseca, Xavier Martorell, Antonio J. Peña

Abstract: Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages wh… ▽ More Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores. △ Less

Submitted 29 August, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: Published on the 27th European Conference on Parallel and Distributed Computing (Euro-Par 2021)

Journal ref: Euro-Par 2021: Parallel Processing. Lecture Notes in Computer Science, vol 12820, pp. 482-498

arXiv:2102.02582 [pdf, ps, other]

Parallelware Tools: An Experimental Evaluation on POWER Systems

Authors: Manuel Arenaz, Xavier Martorell

Abstract: Static code analysis tools are designed to aid software developers to build better quality software in less time, by detecting defects early in the software development life cycle. Even the most experienced developer regularly introduces coding defects. Identifying, mitigating and resolving defects is an essential part of the software development process, but frequently defects can go undetected.… ▽ More Static code analysis tools are designed to aid software developers to build better quality software in less time, by detecting defects early in the software development life cycle. Even the most experienced developer regularly introduces coding defects. Identifying, mitigating and resolving defects is an essential part of the software development process, but frequently defects can go undetected. One defect can lead to a minor malfunction or cause serious security and safety issues. This is magnified in the development of the complex parallel software required to exploit modern heterogeneous multicore hardware. Thus, there is an urgent need for new static code analysis tools to help in building better concurrent and parallel software. The paper reports preliminary results about the use of Appentra's Parallelware technology to address this problem from the following three perspectives: finding concurrency issues in the code, discovering new opportunities for parallelization in the code, and generating parallel-equivalent codes that enable tasks to run faster. The paper also presents experimental results using well-known scientific codes and POWER systems. △ Less

Submitted 4 February, 2021; originally announced February 2021.

arXiv:2009.03066 [pdf, other]

doi 10.1016/j.parco.2020.102664

Asynchronous Runtime with Distributed Manager for Task-based Programming Models

Authors: Jaume Bosch, Carlos Álvarez, Daniel Jiménez-González, Xavier Martorell, Eduard Ayguadé

Abstract: Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access usin… ▽ More Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using synchronization mechanisms (locks) to ensure the dependence management correctness. The contention in the access to these structures becomes critical in many-core systems because several threads may be wasting computation resources waiting their turn. This paper proposes an asynchronous management of the runtime structures, like task dependence graphs, suitable for task-based programming model runtimes. In such organization, the threads request actions to the runtime instead of doing them directly. The requests are then handled by a distributed runtime manager (DDAST) which does not require dedicated resources. Instead, the manager uses the idle threads to modify the runtime structures. The paper also presents an implementation, analysis and performance evaluation of such runtime organization. The performance results show that the proposed asynchronous organization outperforms the speedup obtained by the original runtime for different benchmarks and different many-core architectures. △ Less

Submitted 8 September, 2020; v1 submitted 7 September, 2020; originally announced September 2020.

Comments: 2020 Parallel Computing

Journal ref: Parallel Computing, Volume 97, 2020

arXiv:1912.09844 [pdf, other]

Hurry-up: Scaling Web Search on Big/Little Multi-core Architectures

Authors: Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, Xavier Martorell

Abstract: Heterogeneous multi-core systems such as big/little architectures have been introduced as an attractive server design option with the potential to improve performance under power constraints in data centres. Since both big high-performing and little power-efficient cores can run on the same system sharing the workload processing, thread map**/scheduling turns out to be much more challenging. Thi… ▽ More Heterogeneous multi-core systems such as big/little architectures have been introduced as an attractive server design option with the potential to improve performance under power constraints in data centres. Since both big high-performing and little power-efficient cores can run on the same system sharing the workload processing, thread map**/scheduling turns out to be much more challenging. This is particularly hard when considering the different trade-offs shaped by the heterogeneous cores on the quality-of-service (expressed as tail latency) experienced by user-facing applications, such as Web Search. In this work, we present Hurry-up, a runtime thread map** solution designed to select individual requests to run on the most appropriate heterogeneous cores to improve tail latency. Hurry-up accelerates compute-intensive requests on big cores, while letting less intensive threads to execute on little cores. We implement and deploy Hurry-up on a real 64-bit big/little architecture (ARM Juno), and show that, compared to a conservative policy on Linux, Hurry-up reduces the server tail latency by 39.5% (mean). △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: 7 pages

arXiv:1912.01563 [pdf, other]

LEGaTO: Low-Energy, Secure, and Resilient Toolset for Heterogeneous Computing

Authors: B. Salami, K. Parasyris, A. Cristal, O. Unsal, X. Martorell, P. Carpenter, R. De La Cruz, L. Bautista, D. Jimenez, C. Alvarez, S. Nabavi, S. Madonar, M. Pericas, P. Trancoso, M. Abduljabbar, J. Chen, P. N. Soomro, M Manivannan, M. Berge, S. Krupop, F. Klawonn, Al Mekhlafi, S. May, T. Becker, G. Gaydadjiev , et al. (20 additional authors not shown)

Abstract: The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in… ▽ More The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in December 2017. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 6 pages, 9 figures

arXiv:1508.06830 [pdf]

Coarse-Grain Performance Estimator for Heterogeneous Parallel Computing Architectures like Zynq All-Programmable SoC

Authors: Daniel Jiménez-González, Carlos Álvarez, Antonio Filgueras, Xavier Martorell, Jan Langer, Juanjo Noguera, Kees Vissers

Abstract: Heterogeneous computing is emerging as a mandatory requirement for power-efficient system design. With this aim, modern heterogeneous platforms like Zynq All-Programmable SoC, that integrates ARM-based SMP and programmable logic, have been designed. However, those platforms introduce large design cycles consisting on hardware/software partitioning, decisions on granularity and number of hardware a… ▽ More Heterogeneous computing is emerging as a mandatory requirement for power-efficient system design. With this aim, modern heterogeneous platforms like Zynq All-Programmable SoC, that integrates ARM-based SMP and programmable logic, have been designed. However, those platforms introduce large design cycles consisting on hardware/software partitioning, decisions on granularity and number of hardware accelerators, hardware/software integration, bitstream generation, etc. This paper presents a performance parallel heterogeneous estimation for systems where hardware/software co-design and run-time heterogeneous task scheduling are key. The results show that the programmer can quickly decide, based only on her/his OmpSs (OpenMP + extensions) application, which is the co-design that achieves nearly optimal heterogeneous parallel performance, based on the methodology presented and considering only synthesis estimation results. The methodology presented reduces the programmer co-design decision from hours to minutes and shows high potential on hardware/software heterogeneous parallel performance estimation on the Zynq All-Programmable SoC. △ Less

Submitted 27 August, 2015; originally announced August 2015.

Comments: Presented at Second International Workshop on FPGAs for Software Programmers (FSP 2015) (arXiv:1508.06320)

Report number: FSP/2015/07

Showing 1–9 of 9 results for author: Martorell, X