Skip to main content

Showing 1–15 of 15 results for author: Treibig, J

Searching in archive cs. Search in all archives.
.
  1. Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

    Authors: Holger Stengel, Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of expected speedup. Understanding the performance properties and bottlenecks by performance modeling enables a clear view on promising optimization opportunities. In t… ▽ More

    Submitted 17 January, 2015; v1 submitted 18 October, 2014; originally announced October 2014.

    Comments: 10 pages, 8 figures. Added Roofline comparison and other minor improvements

  2. Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

    Authors: Johannes Hofmann, Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation i… ▽ More

    Submitted 29 January, 2014; originally announced January 2014.

    Comments: arXiv admin note: text overlap with arXiv:1401.3615

  3. arXiv:1401.3615  [pdf, other

    cs.DC cs.CV cs.PF

    Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator

    Authors: Johannes Hofmann, Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: We examine the Xeon Phi, which is based on Intel's Many Integrated Cores architecture, for its suitability to run the FDK algorithm--the most commonly used algorithm to perform the 3D image reconstruction in cone-beam computed tomography. We study the challenges of efficiently parallelizing the application and means to enable sensible data sharing between threads despite the lack of a shared last… ▽ More

    Submitted 17 December, 2013; originally announced January 2014.

  4. arXiv:1304.7664  [pdf, other

    cs.PF cs.DC

    Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations

    Authors: Markus Wittmann, Georg Hager, Thomas Zeiser, Jan Treibig, Gerhard Wellein

    Abstract: Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip performance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation… ▽ More

    Submitted 22 May, 2015; v1 submitted 29 April, 2013; originally announced April 2013.

    Comments: 23 pages, 13 figures; post-peer-review version

  5. arXiv:1303.4538  [pdf, other

    cs.PF cs.DC

    Optimization of FASTEST-3D for Modern Multicore Systems

    Authors: Christoph Scheit, Georg Hager, Jan Treibig, Stefan Becker, Gerhard Wellein

    Abstract: FASTEST-3D is an MPI-parallel finite-volume flow solver based on block-structured meshes that has been developed at the University of Erlangen-Nuremberg since the early 1990s. It can be used to solve the laminar or turbulent incompressible Navier-Stokes equations. Up to now its scalability was strongly limited by a rather rigid communication infrastructure, which led to a dominance of MPI time alr… ▽ More

    Submitted 19 March, 2013; originally announced March 2013.

    Comments: 10 pages, 15 figures

  6. arXiv:1208.2908  [pdf, ps, other

    cs.PF cs.DC

    Exploring performance and power properties of modern multicore chips via simple machine models

    Authors: Georg Hager, Jan Treibig, Johannes Habich, Gerhard Wellein

    Abstract: Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model t… ▽ More

    Submitted 19 March, 2014; v1 submitted 14 August, 2012; originally announced August 2012.

    Comments: 23 pages, 10 figures. Typos corrected, DOI added

  7. Best practices for HPM-assisted performance engineering on modern multicore processors

    Authors: Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Many tools and libraries employ hardware performance monitoring (HPM) on modern processors, and using this data for performance assessment and as a starting point for code optimizations is very popular. However, such data is only useful if it is interpreted with care, and if the right metrics are chosen for the right purpose. We demonstrate the sensible use of hardware performance counters in the… ▽ More

    Submitted 17 June, 2012; originally announced June 2012.

    Comments: 10 pages, 2 figures

  8. Pushing the limits for medical image reconstruction on recent standard multicore processors

    Authors: Jan Treibig, Georg Hager, Hannes G. Hofmann, Joachim Hornegger, Gerhard Wellein

    Abstract: Volume reconstruction by backprojection is the computational bottleneck in many interventional clinical computed tomography (CT) applications. Today vendors in this field replace special purpose hardware accelerators by standard hardware like multicore chips and GPGPUs. Medical imaging algorithms are on the verge of employing High Performance Computing (HPC) technology, and are therefore an intere… ▽ More

    Submitted 20 September, 2011; v1 submitted 27 April, 2011; originally announced April 2011.

    Comments: 13 pages, 9 figures. Revised and extended version

  9. LIKWID: Lightweight Performance Tools

    Authors: Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Exploiting the performance of today's microprocessors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance co… ▽ More

    Submitted 7 January, 2013; v1 submitted 26 April, 2011; originally announced April 2011.

    Comments: 12 pages

  10. arXiv:1104.1729  [pdf, ps, other

    cs.PF cs.PL

    Expression Templates Revisited: A Performance Analysis of the Current ET Methodology

    Authors: Klaus Iglberger, Georg Hager, Jan Treibig, Ulrich Ruede

    Abstract: In the last decade, Expression Templates (ET) have gained a reputation as an efficient performance optimization tool for C++ codes. This reputation builds on several ET-based linear algebra frameworks focused on combining both elegant and high-performance C++ code. However, on closer examination the assumption that ETs are a performance optimization technique cannot be maintained. In this paper we… ▽ More

    Submitted 9 April, 2011; originally announced April 2011.

    Comments: 16 pages, 7 figures

    Journal ref: SIAM Journal on Scientific Computing 34(2), C42-C69 (2012)

  11. arXiv:1006.3148  [pdf, other

    cs.DC cs.PF

    Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

    Authors: Markus Wittmann, Georg Hager, Jan Treibig, Gerhard Wellein

    Abstract: Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three c… ▽ More

    Submitted 16 June, 2010; originally announced June 2010.

    Comments: 16 pages, 10 figures

  12. arXiv:1004.4431  [pdf, ps, other

    cs.DC cs.PF

    LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments

    Authors: Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: Exploiting the performance of today's processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter… ▽ More

    Submitted 30 June, 2010; v1 submitted 26 April, 2010; originally announced April 2010.

    Comments: 10 pages, 11 figures. Some clarifications and corrections

  13. Efficient multicore-aware parallelization strategies for iterative stencil computations

    Authors: Jan Treibig, Gerhard Wellein, Georg Hager

    Abstract: Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache-based multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory b… ▽ More

    Submitted 10 April, 2010; originally announced April 2010.

    Comments: 15 pages, 10 figures

    Journal ref: Journal of Computational Science 2, 130-137 (2011)

  14. arXiv:0910.4865  [pdf, other

    cs.PF cs.AR

    Multi-core architectures: Complexities of performance prediction and the impact of cache topology

    Authors: Jan Treibig, Georg Hager, Gerhard Wellein

    Abstract: The balance metric is a simple approach to estimate the performance of bandwidth-limited loop kernels. However, applying the method to in-cache situations and modern multi-core architectures yields unsatisfactory results. This paper analyzes the in uence of cache hierarchy design on performance predictions for bandwidth-limited loop kernels on current mainstream processors. We present a diagnost… ▽ More

    Submitted 26 October, 2009; originally announced October 2009.

    Comments: 18 pages

  15. arXiv:0905.0792  [pdf, other

    cs.PF cs.AR

    Introducing a Performance Model for Bandwidth-Limited Loop Kernels

    Authors: Jan Treibig, Georg Hager

    Abstract: We present a performance model for bandwidth limited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes. It provides an in-depth understanding of how performance for different memory hierarchy levels is made up. The performance of raw memory load, store and copy… ▽ More

    Submitted 6 May, 2009; originally announced May 2009.

    Comments: 8 pages