-
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads
Authors:
Jens Domke,
Emil Vatai,
Balazs Gerofi,
Yuetsu Kodama,
Mohamed Wahib,
Artur Podobas,
Sparsh Mittal,
Miquel Pericàs,
Lingqi Zhang,
Peng Chen,
Aleksandr Drozd,
Satoshi Matsuoka
Abstract:
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method…
▽ More
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.
△ Less
Submitted 16 October, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Matrix Engines for High Performance Computing:A Paragon of Performance or Gras** at Straws?
Authors:
Jens Domke,
Emil Vatai,
Aleksandr Drozd,
Peng Chen,
Yosuke Oyama,
Lingqi Zhang,
Shweta Salaria,
Daichi Mukunoki,
Artur Podobas,
Mohamed Wahib,
Satoshi Matsuoka
Abstract:
Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
Hence…
▽ More
Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.
△ Less
Submitted 27 February, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Cache optimized linear sieve
Authors:
A. Járai,
E. Vatai
Abstract:
Sieving is essential in different number theoretical algorithms. Sieving with large primes violates locality of memory access, thus degrading performance. Our suggestion on how to tackle this problem is to use cyclic data structures in combination with in-place bucket-sort. We present our results on the implementation of the sieve of Eratosthenes, using these ideas, which show that this approach i…
▽ More
Sieving is essential in different number theoretical algorithms. Sieving with large primes violates locality of memory access, thus degrading performance. Our suggestion on how to tackle this problem is to use cyclic data structures in combination with in-place bucket-sort. We present our results on the implementation of the sieve of Eratosthenes, using these ideas, which show that this approach is more robust and less affected by slow memory.
△ Less
Submitted 14 November, 2011;
originally announced November 2011.