Search | arXiv e-print repository

arXiv:2104.13242 [pdf, other]

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version)

Authors: Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, Mary Hall

Abstract: In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to opt… ▽ More In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance. Then we present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd-Warshall benchmark. We also extend the ytopt autotuning framework to tune a deep learning application. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: Submitted to CCPE journal. arXiv admin note: substantial text overlap with arXiv:2010.08040

arXiv:2102.01687 [pdf, other]

Report of the Workshop on Program Synthesis for Scientific Computing

Authors: Hal Finkel, Ignacio Laguna

Abstract: Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities for future work. This report is the result of the Wo… ▽ More Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities for future work. This report is the result of the Workshop on Program Synthesis for Scientific Computing was held virtually on August 4-5 2020 (https://prog-synth-science.github.io/2020/). △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: 29 pages, workshop website: https://prog-synth-science.github.io/2020/

arXiv:2010.08439 [pdf, other]

Really Embedding Domain-Specific Languages into C++

Authors: Hal Finkel, Alexander McCaskey, Tobi Popoola, Dmitry Lyakh, Johannes Doerfert

Abstract: Domain-specific languages (DSLs) are both pervasive and powerful, but remain difficult to integrate into large projects. As a result, while DSLs can bring distinct advantages in performance, reliability, and maintainability, their use often involves trading off other good software-engineering practices. In this paper, we describe an extension to the Clang C++ compiler to support syntax plugins, an… ▽ More Domain-specific languages (DSLs) are both pervasive and powerful, but remain difficult to integrate into large projects. As a result, while DSLs can bring distinct advantages in performance, reliability, and maintainability, their use often involves trading off other good software-engineering practices. In this paper, we describe an extension to the Clang C++ compiler to support syntax plugins, and we demonstrate how this mechanism allows making use of DSLs inside of a C++ code base without needing to separate the DSL source code from the surrounding C++ code. △ Less

Submitted 16 October, 2020; originally announced October 2020.

arXiv:2010.08040 [pdf, other]

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization

Authors: Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, Mary Hall

Abstract: An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models to identify a high performance implementation/configuration. In this paper, we develop an autotuning framework that leverages Bayesian optimization to explore… ▽ More An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models to identify a high performance implementation/configuration. In this paper, we develop an autotuning framework that leverages Bayesian optimization to explore the parameter space search. We select six of the most complex benchmarks from the application domains of the PolyBench benchmarks (syr2k, 3mm, heat-3d, lu, covariance, and Floyd-Warshall) and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance. △ Less

Submitted 15 October, 2020; originally announced October 2020.

Comments: to be published in the 11th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS20)

arXiv:2010.06521 [pdf, other]

Autotuning Search Space for Loop Transformations

Authors: Michael Kruse, Hal Finkel, Xingfu Wu

Abstract: One of the challenges for optimizing compilers is to predict whether applying an optimization will improve its execution speed. Programmers may override the compiler's profitability heuristic using optimization directives such as pragmas in the source code. Machine learning in the form of autotuning can assist users in finding the best optimizations for each platform. In this paper we propose a… ▽ More One of the challenges for optimizing compilers is to predict whether applying an optimization will improve its execution speed. Programmers may override the compiler's profitability heuristic using optimization directives such as pragmas in the source code. Machine learning in the form of autotuning can assist users in finding the best optimizations for each platform. In this paper we propose a loop transformation search space that takes the form of a tree, in contrast to previous approaches that usually use vector spaces to represent loop optimization configurations. We implemented a simple autotuner exploring the search space and applied it to a selected set of PolyBench kernels. While the autotuner is capable of representing every possible sequence of loop transformations and their relations, the results motivate the use of better search strategies such as Monte Carlo tree search to find sophisticated loop transformations such as multilevel tiling. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: LLVM-in-HPC 2020 preprint

arXiv:2010.03935 [pdf, other]

Extending C++ for Heterogeneous Quantum-Classical Computing

Authors: Thien Nguyen, Anthony Santana, Tyler Kharazi, Daniel Claudino, Hal Finkel, Alexander McCaskey

Abstract: We present qcor - a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context. Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language agnostic manner, as well as a hardware-agnostic, retargetable compiler workflow tar… ▽ More We present qcor - a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context. Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language agnostic manner, as well as a hardware-agnostic, retargetable compiler workflow targeting a number of physical and virtual quantum computing backends. qcor leverages novel Clang plugin interfaces and builds upon the XACC system-level quantum programming framework to provide a state-of-the-art integration mechanism for quantum-classical compilation that leverages the best from the community at-large. qcor translates quantum kernels ultimately to the XACC intermediate representation, and provides user-extensible hooks for quantum compilation routines like circuit optimization, analysis, and placement. This work details the overall architecture and compiler workflow for qcor, and provides a number of illuminating programming examples demonstrating its utility for near-term variational tasks, quantum algorithm expression, and feed-forward error correction schemes. △ Less

Submitted 8 October, 2020; originally announced October 2020.

arXiv:1910.02375 [pdf, ps, other]

doi 10.1007/978-3-030-28596-8_9

Design and Use of Loop-Transformation Pragmas

Authors: Michael Kruse, Hal Finkel

Abstract: Adding a pragma directive into the source code is arguably easier than rewriting it, for instance for loop unrolling. Moreover, if the application is maintained for multiple platforms, their difference in performance characteristics may require different code transformations. Code transformation directives allow replacing the directives depending on the platform, i.e. separation of code semantics… ▽ More Adding a pragma directive into the source code is arguably easier than rewriting it, for instance for loop unrolling. Moreover, if the application is maintained for multiple platforms, their difference in performance characteristics may require different code transformations. Code transformation directives allow replacing the directives depending on the platform, i.e. separation of code semantics and its performance optimization. In this paper, we explore the design space (syntax and semantics) of adding such directive into a future OpenMP specification. Using a prototype implementation in Clang, we demonstrate the usefulness of such directives on a few benchmarks. △ Less

Submitted 6 October, 2019; originally announced October 2019.

Comments: IWOMP 2019, September 11-13, Auckland, preprint

arXiv:1904.08555 [pdf, other]

ClangJIT: Enhancing C++ with Just-in-Time Compilation

Authors: Hal Finkel, David Poliakoff, David F. Richards

Abstract: The C++ programming language is not only a keystone of the high-performance-computing ecosystem but has proven to be a successful base for portable parallel-programming frameworks. As is well known, C++ programmers use templates to specialize algorithms, thus allowing the compiler to generate highly-efficient code for specific parameters, data structures, and so on. This capability has been limite… ▽ More The C++ programming language is not only a keystone of the high-performance-computing ecosystem but has proven to be a successful base for portable parallel-programming frameworks. As is well known, C++ programmers use templates to specialize algorithms, thus allowing the compiler to generate highly-efficient code for specific parameters, data structures, and so on. This capability has been limited to those specializations that can be identified when the application is compiled, and in many critical cases, compiling all potentially-relevant specializations is not practical. ClangJIT provides a well-integrated C++ language extension allowing template-based specialization to occur during program execution. This capability has been implemented for use in large-scale applications, and we demonstrate that just-in-time-compilation-based dynamic specialization can be integrated into applications, often requiring minimal changes (or no changes) to the applications themselves, providing significant performance improvements, programmer-productivity improvements, and decreased compilation time. △ Less

Submitted 27 April, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

Report number: LLNL-CONF-772305, APT-151745

arXiv:1811.05630 [pdf, other]

Memory-Efficient Quantum Circuit Simulation by Using Lossy Data Compression

Authors: Xin-Chuan Wu, Sheng Di, Franck Cappello, Hal Finkel, Yuri Alexeev, Frederic T. Chong

Abstract: In order to evaluate, validate, and refine the design of new quantum algorithms or quantum computers, researchers and developers need methods to assess their correctness and fidelity. This requires the capabilities of quantum circuit simulations. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requir… ▽ More In order to evaluate, validate, and refine the design of new quantum algorithms or quantum computers, researchers and developers need methods to assess their correctness and fidelity. This requires the capabilities of quantum circuit simulations. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requirement for the simulations. In this work, we present our memory-efficient quantum circuit simulation by using lossy data compression. Our empirical data shows that we reduce the memory requirement to 16.5% and 2.24E-06 of the original requirement for QFT and Grover's search, respectively. This finding further suggests that we can simulate deep quantum circuits up to 63 qubits with 0.8 petabytes memory. △ Less

Submitted 14 November, 2018; v1 submitted 13 November, 2018; originally announced November 2018.

Comments: 2 pages, 2 figures. The 3rd International Workshop on Post-Moore Era Supercomputing (PMES)

arXiv:1811.05140 [pdf, other]

Amplitude-Aware Lossy Compression for Quantum Circuit Simulation

Authors: Xin-Chuan Wu, Sheng Di, Franck Cappello, Hal Finkel, Yuri Alexeev, Frederic T. Chong

Abstract: Classical simulation of quantum circuits is crucial for evaluating and validating the design of new quantum algorithms. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requirement for the simulations. In this paper, we present a new data reduction technique to reduce the memory requirement of quantum… ▽ More Classical simulation of quantum circuits is crucial for evaluating and validating the design of new quantum algorithms. However, the number of quantum state amplitudes increases exponentially with the number of qubits, leading to the exponential growth of the memory requirement for the simulations. In this paper, we present a new data reduction technique to reduce the memory requirement of quantum circuit simulations. We apply our amplitude-aware lossy compression technique to the quantum state amplitude vector to trade the computation time and fidelity for memory space. The experimental results show that our simulator only needs 1/16 of the original memory requirement to simulate Quantum Fourier Transform circuits with 99.95% fidelity. The reduction amount of memory requirement suggests that we could increase 4 qubits in the quantum circuit simulation comparing to the simulation without our technique. Additionally, for some specific circuits, like Grover's search, we could increase the simulation size by 18 qubits. △ Less

Submitted 14 November, 2018; v1 submitted 13 November, 2018; originally announced November 2018.

Comments: 6pages, 6 figures. The 4th International Workshop on Data Reduction for Big Scientific Data (DRBSD-4)

arXiv:1811.00632 [pdf, other]

Loop Optimization Framework

Authors: Michael Kruse, Hal Finkel

Abstract: The LLVM compiler framework supports a selection of loop transformations such as vectorization, distribution and unrolling. Each transformation is carried-out by specialized passes that have been developed independently. In this paper we propose an integrated approach to loop optimizations: A single dedicated pass that mutates a Loop Structure DAG. Each transformation can make use of a common infr… ▽ More The LLVM compiler framework supports a selection of loop transformations such as vectorization, distribution and unrolling. Each transformation is carried-out by specialized passes that have been developed independently. In this paper we propose an integrated approach to loop optimizations: A single dedicated pass that mutates a Loop Structure DAG. Each transformation can make use of a common infrastructure such as dependency analysis, transformation preconditions, etc. △ Less

Submitted 1 November, 2018; originally announced November 2018.

Comments: LCPC'18 preprint

arXiv:1811.00624 [pdf, other]

User-Directed Loop-Transformations in Clang

Authors: Michael Kruse, Hal Finkel

Abstract: Directives for the compiler such as pragmas can help programmers to separate an algorithm's semantics from its optimization. This keeps the code understandable and easier to optimize for different platforms. Simple transformations such as loop unrolling are already implemented in most mainstream compilers. We recently submitted a proposal to add generalized loop transformations to the OpenMP stand… ▽ More Directives for the compiler such as pragmas can help programmers to separate an algorithm's semantics from its optimization. This keeps the code understandable and easier to optimize for different platforms. Simple transformations such as loop unrolling are already implemented in most mainstream compilers. We recently submitted a proposal to add generalized loop transformations to the OpenMP standard. We are also working on an implementation in LLVM/Clang/Polly to show its feasibility and usefulness. The current prototype allows applying patterns common to matrix-matrix multiplication optimizations. △ Less

Submitted 1 November, 2018; originally announced November 2018.

Comments: LLVM-HPC Workshop 2018 preprint

arXiv:1805.03374 [pdf, other]

doi 10.1007/978-3-319-98521-3_3

A Proposal for Loop-Transformation Pragmas

Authors: Michael Kruse, Hal Finkel

Abstract: Pragmas for loop transformations, such as unrolling, are implemented in most mainstream compilers. They are used by application programmers because of their ease of use compared to directly modifying the source code of the relevant loops. We propose additional pragmas for common loop transformations that go far beyond the transformations today's compilers provide and should make most source rewrit… ▽ More Pragmas for loop transformations, such as unrolling, are implemented in most mainstream compilers. They are used by application programmers because of their ease of use compared to directly modifying the source code of the relevant loops. We propose additional pragmas for common loop transformations that go far beyond the transformations today's compilers provide and should make most source rewriting for the sake of loop optimization unnecessary. To encourage compilers to implement these pragmas, and to avoid a diversity of incompatible syntaxes, we would like to spark a discussion about an inclusion to the OpenMP standard. △ Less

Submitted 11 June, 2018; v1 submitted 9 May, 2018; originally announced May 2018.

Comments: IWOMP'18 preprint

arXiv:1610.02606 [pdf, other]

Doing Moore with Less -- Leapfrogging Moore's Law with Inexactness for Supercomputing

Authors: Sven Leyffer, Stefan M. Wild, Mike Fagan, Marc Snir, Krishna Palem, Kazutomo Yoshii, Hal Finkel

Abstract: Energy and power consumption are major limitations to continued scaling of computing systems. Inexactness, where the quality of the solution can be traded for energy savings, has been proposed as an approach to overcoming those limitations. In the past, however, inexactness necessitated the need for highly customized or specialized hardware. The current evolution of commercial off-the-shelf(COTS)… ▽ More Energy and power consumption are major limitations to continued scaling of computing systems. Inexactness, where the quality of the solution can be traded for energy savings, has been proposed as an approach to overcoming those limitations. In the past, however, inexactness necessitated the need for highly customized or specialized hardware. The current evolution of commercial off-the-shelf(COTS) processors facilitates the use of lower-precision arithmetic in ways that reduce energy consumption. We study these new opportunities in this paper, using the example of an inexact Newton algorithm for solving nonlinear equations. Moreover, we have begun develo** a set of techniques we call reinvestment that, paradoxically, use reduced precision to improve the quality of the computed result: They do so by reinvesting the energy saved by reduced precision. △ Less

Submitted 12 October, 2016; v1 submitted 8 October, 2016; originally announced October 2016.

Comments: 9 pages, 12 figures, PDFLaTeX. 12 Oct 2016: Corrected author Hal Finkel's affiliation to show ALCF/Argonne

ACM Class: F.2.1; G.1.5

arXiv:1510.08545 [pdf, ps, other]

High Energy Physics Forum for Computational Excellence: Working Group Reports (I. Applications Software II. Software Libraries and Tools III. Systems)

Authors: Salman Habib, Robert Roser, Tom LeCompte, Zach Marshall, Anders Borgland, Brett Viren, Peter Nugent, Makoto Asai, Lothar Bauerdick, Hal Finkel, Steve Gottlieb, Stefan Hoeche, Paul Sheldon, Jean-Luc Vay, Peter Elmer, Michael Kirby, Simon Patton, Maxim Potekhin, Brian Yanny, Paolo Calafiura, Eli Dart, Oliver Gutsche, Taku Izubuchi, Adam Lyon, Don Petravick

Abstract: Computing plays an essential role in all aspects of high energy physics. As computational technology evolves rapidly in new directions, and data throughput and volume continue to follow a steep trend-line, it is important for the HEP community to develop an effective response to a series of expected challenges. In order to help shape the desired response, the HEP Forum for Computational Excellence… ▽ More Computing plays an essential role in all aspects of high energy physics. As computational technology evolves rapidly in new directions, and data throughput and volume continue to follow a steep trend-line, it is important for the HEP community to develop an effective response to a series of expected challenges. In order to help shape the desired response, the HEP Forum for Computational Excellence (HEP-FCE) initiated a roadmap planning activity with two key overlap** drivers -- 1) software effectiveness, and 2) infrastructure and expertise advancement. The HEP-FCE formed three working groups, 1) Applications Software, 2) Software Libraries and Tools, and 3) Systems (including systems software), to provide an overview of the current status of HEP computing and to present findings and opportunities for the desired HEP computational roadmap. The final versions of the reports are combined in this document, and are presented along with introductory material. △ Less

Submitted 28 October, 2015; originally announced October 2015.

Comments: 72 pages

arXiv:1211.4864 [pdf, other]

The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q

Authors: Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, David Daniel, Patricia Fasel, Nicholas Frontiere, Zarija Lukic

Abstract: Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -- dark matter and dark energy -- remain mysterious. Sky surveys that map billions of galaxies to explore the `Dark Universe', demand a corresponding extreme-scale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been de… ▽ More Remarkable observational advances have established a compelling cross-validated model of the Universe. Yet, two key pillars of this model -- dark matter and dark energy -- remain mysterious. Sky surveys that map billions of galaxies to explore the `Dark Universe', demand a corresponding extreme-scale simulation capability; the HACC (Hybrid/Hardware Accelerated Cosmology Code) framework has been designed to deliver this level of performance now, and into the future. With its novel algorithmic structure, HACC allows flexible tuning across diverse architectures, including accelerated and multi-core systems. On the IBM BG/Q, HACC attains unprecedented scalable performance -- currently 13.94 PFlops at 69.2% of peak and 90% parallel efficiency on 1,572,864 cores with an equal number of MPI ranks, and a concurrency of 6.3 million. This level of performance was achieved at extreme problem sizes, including a benchmark run with more than 3.6 trillion particles, significantly larger than any cosmological simulation yet performed. △ Less

Submitted 19 November, 2012; originally announced November 2012.

Comments: 11 pages, 11 figures, final version of paper for talk presented at SC12

Showing 1–16 of 16 results for author: Finkel, H