-
LLAMA: The Low-Level Abstraction For Memory Access
Authors:
Bernhard Manfred Gruber,
Guilherme Amadio,
Jakob Blomer,
Alexander Matthes,
René Widera,
Michael Bussmann
Abstract:
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished v…
▽ More
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged.
We present the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators.
Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Structs) and SoA (Struct of Arrays) layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU\textsuperscript{\textregistered} lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying.
LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.
△ Less
Submitted 9 March, 2022; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
Authors:
Alexander Matthes,
René Widera,
Erik Zenker,
Benjamin Worpitz,
Axel Huebl,
Michael Bussmann
Abstract:
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this e…
▽ More
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.
△ Less
Submitted 30 June, 2017;
originally announced June 2017.
-
On the Scalability of Data Reduction Techniques in Current and Upcoming HPC Systems from an Application Perspective
Authors:
Axel Huebl,
Rene Widera,
Felix Schmitt,
Alexander Matthes,
Norbert Podhorszki,
Jong Youl Choi,
Scott Klasky,
Michael Bussmann
Abstract:
We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today's and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threa…
▽ More
We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today's and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threaded data-transformations for the I/O library ADIOS as a feasible way to trade underutilized host-side compute potential on heterogeneous systems for reduced I/O latency.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
In situ, steerable, hardware-independent and data-structure agnostic visualization with ISAAC
Authors:
Alexander Matthes,
Axel Huebl,
René Widera,
Sebastian Grottel,
Stefan Gumhold,
Michael Bussmann
Abstract:
The computation power of supercomputers grows faster than the bandwidth of their storage and network. Especially applications using hardware accelerators like Nvidia GPUs cannot save enough data to be analyzed in a later step. There is a high risk of loosing important scientific information. We introduce the in situ template library ISAAC which enables arbitrary applications like scientific simula…
▽ More
The computation power of supercomputers grows faster than the bandwidth of their storage and network. Especially applications using hardware accelerators like Nvidia GPUs cannot save enough data to be analyzed in a later step. There is a high risk of loosing important scientific information. We introduce the in situ template library ISAAC which enables arbitrary applications like scientific simulations to live visualize their data without the need of deep copy operations or data transformation using the very same compute node and hardware accelerator the data is already residing on. Arbitrary meta data can be added to the renderings and user defined steering commands can be asynchronously sent back to the running application. Using an aggregating server, ISAAC streams the interactive visualization video and enables user to access their applications from everywhere.
△ Less
Submitted 28 November, 2016;
originally announced November 2016.