Showing 1–2 of 2 results for author: Hornich, J

Search v0.5.6 released 2020-02-24

arXiv:1906.08138 [pdf, other]

cs.PF

doi 10.14529/jsfi190301

Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

Authors: Julian Hornich, Julian Hammer, Georg Hager, Thomas Gruber, Gerhard Wellein

Abstract: Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a… ▽ More Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a generalizable methodology for reproducible measurements accompanied by state-of-the-art performance models. Our open-source toolchain, and collected results are publicly available in the "Intranode Stencil Performance Evaluation Collection" (INSPECT). We present the underlying methodologies, models and tools involved in gathering and documenting the performance behavior of a collection of typical stencil patterns across multiple architectures and hardware configuration options. Our aim is to endow performance-aware application developers with reproducible baseline performance data and validated models to initiate a well-defined process of performance assessment and optimization. △ Less

Submitted 2 July, 2019; v1 submitted 19 June, 2019; originally announced June 2019.
arXiv:1510.05218 [pdf, other]

cs.CE cs.DC cs.PF

Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization

Authors: Tareq M. Malas, Julian Hornich, Georg Hager, Hatem Ltaief, Christoph Pflaum, David E. Keyes

Abstract: Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Har… ▽ More Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell's Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Harmonic Inverse Iteration Method (THIIM) with Finite-Difference Frequency Domain (FDFD) discretization. Although this algorithm has the characteristics of a strongly bandwidth-bound stencil update scheme, it is significantly different from the popular stencil types that have been exhaustively studied in the high performance computing literature to date. We apply a recently developed stencil optimization technique, multicore wavefront diamond tiling with multi-dimensional cache block sharing, and describe in detail the peculiarities that need to be considered due to the special stencil structure. Concurrency in updating the components of the electric and magnetic fields provides an additional level of parallelism. The dependence of the cache size requirement of the optimized code on the blocking parameters is modeled accurately, and an auto-tuner searches for optimal configurations in the remaining parameter space. We were able to completely decouple the execution from the memory bandwidth bottleneck, accelerating the implementation by a factor of three to four compared to an optimal implementation with pure spatial blocking on an 18-core Intel Haswell CPU. △ Less

Submitted 18 October, 2015; originally announced October 2015.

Search v0.5.6 released 2020-02-24