Search | arXiv e-print repository

A shared compilation stack for distributed-memory parallelism in stencil DSLs

Authors: George Bisbas, Anton Lydike, Emilien Bauer, Nick Brown, Mathieu Fehr, Lawrence Mitchell, Gabriel Rodriguez-Canal, Maurice Jamieson, Paul H. J. Kelly, Michel Steuwer, Tobias Grosser

Abstract: Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloe… ▽ More Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2311.07422 [pdf, other]

Sidekick compilation with xDSL

Authors: Mathieu Fehr, Michel Weber, Christian Ulmann, Alexandre Lopoukhine, Martin Lücke, Théo Degioanni, Michel Steuwer, Tobias Grosser

Abstract: Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototy** in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition… ▽ More Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototy** in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition from a prototype compiler to production requires significant engineering work. To bridge this gap, we introduce the concept of sidekick compiler frameworks, an approach that uses multiple frameworks that interoperate with each other by leveraging textual interchange formats and declarative descriptions of abstractions. Each such compiler framework is specialized for specific use cases, such as performance or prototy**. Abstractions are by design shared across frameworks, simplifying the transition from prototy** to production. We demonstrate this idea with xDSL, a sidekick for MLIR focused on prototy** and teaching. xDSL interoperates with MLIR through a shared textual IR and the exchange of IRs through an IR Definition Language. The benefits of sidekick compiler frameworks are evaluated by showing on three use cases how xDSL impacts their development: teaching, DSL compilation, and rewrite system prototy**. We also investigate the trade-offs that xDSL offers, and demonstrate how we simplify the transition between frameworks using the IRDL dialect. With sidekick compilation, we envision a future in which engineers minimize the cost of development by choosing a framework built for their immediate needs, and later transitioning to production with minimal overhead. △ Less

Submitted 16 June, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: 14 pages, 15 figures; updated twice to include acknowledgements

arXiv:2305.03448 [pdf, other]

Descend: A Safe GPU Systems Programming Language

Authors: Bastian Köpcke, Sergei Gorlatch, Michel Steuwer

Abstract: Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream GPU prog… ▽ More Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream GPU programming languages, such as CUDA and OpenCL, are based on C/C++ inheriting their fundamentally unsafe ways to access memory via raw pointers. This facilitates easy to make, but hard to detect bugs such as data races and deadlocks. In this paper, we present Descend: a safe GPU systems programming language. In the spirit of Rust, Descend's type system enforces safe CPU and GPU memory management by tracking Ownership and Lifetimes. Descend introduces a new holistic GPU programming model where computations are hierarchically scheduled over the GPU's execution resources: grid, blocks, and threads. Descend's extended Borrow checking ensures that execution resources safely access memory regions without introducing data races. For this, we introduced views describing safe parallel access patterns of memory regions. We discuss the memory safety guarantees offered by Descend's type system and evaluate our implementation of Descend using a number of benchmarks, showing that no significant runtime overhead is introduced compared to manually written CUDA programs lacking Descend's safety guarantees. △ Less

Submitted 5 May, 2023; originally announced May 2023.

arXiv:2304.14154 [pdf, other]

Traced Types for Safe Strategic Rewriting

Authors: Rongxiao Fu, Ornela Dardha, Michel Steuwer

Abstract: Strategy languages enable programmers to compose rewrite rules into strategies and control their application. This is useful in programming languages, e.g., for describing program transformations compositionally, but also in automated theorem proving, where related ideas have been studies with tactics languages. Clearly, not all compositions of rewrites are correct, but how can we assist programme… ▽ More Strategy languages enable programmers to compose rewrite rules into strategies and control their application. This is useful in programming languages, e.g., for describing program transformations compositionally, but also in automated theorem proving, where related ideas have been studies with tactics languages. Clearly, not all compositions of rewrites are correct, but how can we assist programmers in writing correct strategies? In this paper, we present a static type system for strategy languages. We combine a structural type system capturing how rewrite strategies transform the shape of the rewritten syntax with a novel tracing system that keeps track of all possible legal strategy execution paths. Our type system raises warnings when parts of a composition are guaranteed to fail at runtime, and errors when no legal execution for a strategy is possible. We present a formalization of our strategy language and novel tracing type system, and formally prove its type soundness. We present formal results, showing that ill-traced strategies are guaranteed to fail at runtime and that well-traced strategy executions "can't go wrong", meaning that they are guaranteed to have a possible successful execution path. △ Less

Submitted 27 April, 2023; originally announced April 2023.

arXiv:2304.08267 [pdf, ps, other]

doi 10.1145/3622836

Structural Subty** as Parametric Polymorphism

Authors: Wenhao Tang, Daniel Hillerström, James McKinna, Michel Steuwer, Ornela Dardha, Rongxiao Fu, Sam Lindley

Abstract: Structural subty** and parametric polymorphism provide similar flexibility and reusability to programmers. For example, both features enable the programmer to provide a wider record as an argument to a function that expects a narrower one. However, the means by which they do so differs substantially, and the precise details of the relationship between them exists, at best, as folklore in literat… ▽ More Structural subty** and parametric polymorphism provide similar flexibility and reusability to programmers. For example, both features enable the programmer to provide a wider record as an argument to a function that expects a narrower one. However, the means by which they do so differs substantially, and the precise details of the relationship between them exists, at best, as folklore in literature. In this paper, we systematically study the relative expressive power of structural subty** and parametric polymorphism. We focus our investigation on establishing the extent to which parametric polymorphism, in the form of row and presence polymorphism, can encode structural subty** for variant and record types. We base our study on various Church-style $λ$-calculi extended with records and variants, different forms of structural subty**, and row and presence polymorphism. We characterise expressiveness by exhibiting compositional translations between calculi. For each translation we prove a type preservation and operational correspondence result. We also prove a number of non-existence results. By imposing restrictions on both source and target types, we reveal further subtleties in the expressiveness landscape, the restrictions enabling otherwise impossible translations to be defined. More specifically, we prove that full subty** cannot be encoded via polymorphism, but we show that several restricted forms of subty** can be encoded via particular forms of polymorphism. △ Less

Submitted 11 September, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 47 pages, accepted by OOPSLA 2023

arXiv:2212.11142 [pdf, other]

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Authors: Erik Hellsten, Artur Souza, Johannes Lenfers, Rubens Lacouture, Olivia Hsu, Adel Ejjeh, Fredrik Kjolstad, Michel Steuwer, Kunle Olukotun, Luigi Nardi

Abstract: We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these… ▽ More We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimiza tion algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art autotuners by delivering on average 1.36x-1.56x faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9x-3.9x faster. △ Less

Submitted 11 April, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2205.09655 [pdf]

doi 10.22152/programming-journal.org/2023/7/11

Primrose: Selecting Container Data Types by Their Properties

Authors: Xueying Qin, Liam O'Connor, Michel Steuwer

Abstract: Context: Container data types are ubiquitous in computer programming, enabling developers to efficiently store and process collections of data with an easy-to-use programming interface. Many programming languages offer a variety of container implementations in their standard libraries based on data structures offering different capabilities and performance characteristics. Inquiry: Choosing the… ▽ More Context: Container data types are ubiquitous in computer programming, enabling developers to efficiently store and process collections of data with an easy-to-use programming interface. Many programming languages offer a variety of container implementations in their standard libraries based on data structures offering different capabilities and performance characteristics. Inquiry: Choosing the *best* container for an application is not always straightforward, as performance characteristics can change drastically in different scenarios, and as real-world performance is not always correlated to theoretical complexity. Approach: We present Primrose, a language-agnostic tool for selecting the best performing valid container implementation from a set of container data types that satisfy *properties* given by application developers. Primrose automatically selects the set of valid container implementations for which the *library specifications*, written by the developers of container libraries, satisfies the specified properties. Finally, Primrose ranks the valid library implementations based on their runtime performance. Knowledge: With Primrose, application developers can specify the expected behaviour of a container as a type refinement with *semantic properties*, e.g., if the container should only contain unique values (such as a `set`) or should satisfy the LIFO property of a `stack`. Semantic properties nicely complement *syntactic properties* (i.e., traits, interfaces, or type classes), together allowing developers to specify a container's programming interface *and* behaviour without committing to a concrete implementation. Grounding: We present our prototype implementation of Primrose that preprocesses annotated Rust code, selects valid container implementations and ranks them on their performance. The design of Primrose is, however, language-agnostic, and is easy to integrate into other programming languages that support container data types and traits, interfaces, or type classes. Our implementation encodes properties and library specifications into verification conditions in Rosette, an interface for SMT solvers, which determines the set of valid container implementations. We evaluate Primrose by specifying several container implementations, and measuring the time taken to select valid implementations for various combinations of properties with the solver. We automatically validate that container implementations conform to their library specifications via property-based testing. Importance: This work provides a novel approach to bring abstract modelling and specification of container types directly into the programmer's workflow. Instead of selecting concrete container implementations, application programmers can now work on the level of specification, merely stating the behaviours they require from their container types, and the best implementation can be selected automatically. △ Less

Submitted 20 February, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

Journal ref: The Art, Science, and Engineering of Programming, 2023, Vol. 7, Issue 3, Article 11

arXiv:2201.03611 [pdf, other]

RISE & Shine: Language-Oriented Compiler Design

Authors: Michel Steuwer, Thomas Koehler, Bastian Köpcke, Federico Pizzuti

Abstract: The trend towards specialization of software and hardware - fuelled by the end of Moore's law and the still accelerating interest in domain-specific computing, such as machine learning - forces us to radically rethink our compiler designs. The era of a universal compiler framework built around a single one-size-fits-all intermediate representation (IR) is over. This realization has sparked the cre… ▽ More The trend towards specialization of software and hardware - fuelled by the end of Moore's law and the still accelerating interest in domain-specific computing, such as machine learning - forces us to radically rethink our compiler designs. The era of a universal compiler framework built around a single one-size-fits-all intermediate representation (IR) is over. This realization has sparked the creation of the MLIR compiler framework that empowers compiler engineers to design and integrate IRs capturing specific abstractions. MLIR provides a generic framework for SSA-based IRs, but it doesn't help us to decide how we should design IRs that are easy to develop, to work with and to combine into working compilers. To address the challenge of IR design, we advocate for a language-oriented compiler design that understands IRs as formal programming languages and enforces their correct use via an accompanying type system. We argue that programming language techniques directly guide extensible IR designs and provide a formal framework to reason about transforming between multiple IRs. In this paper, we discuss the design of the Shine compiler that compiles the high-level functional pattern-based data-parallel language RISE via a hybrid functional-imperative intermediate language to C, OpenCL, and OpenMP. We compare our work directly with the closely related pattern-based Lift IR and compiler. We demonstrate that our language-oriented compiler design results in a more robust and predictable compiler that is extensible at various abstraction levels. Our experimental evaluation shows that this compiler design is able to generate high-performance GPU code. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2111.13040 [pdf, other]

Sketch-Guided Equality Saturation: Scaling Equality Saturation to Complex Optimizations of Functional Programs

Authors: Thomas Koehler, Phil Trinder, Michel Steuwer

Abstract: Generating high-performance code for diverse hardware and application domains is challenging. Functional array programming languages with patterns like map and reduce have been successfully combined with term rewriting to define and explore optimization spaces. However, deciding what sequence of rewrites to apply is hard and has a huge impact on the performance of the rewritten program. Equality s… ▽ More Generating high-performance code for diverse hardware and application domains is challenging. Functional array programming languages with patterns like map and reduce have been successfully combined with term rewriting to define and explore optimization spaces. However, deciding what sequence of rewrites to apply is hard and has a huge impact on the performance of the rewritten program. Equality saturation avoids the issue by exploring many possible ways to apply rewrites, efficiently representing many equivalent programs in an e-graph data structure. Equality saturation has some limitations when rewriting functional language terms, as currently naive encodings of the lambda calculus are used. We present new techniques for encoding polymorphically typed lambda calculi, and show that the efficient encoding reduces the runtime and memory consumption of equality saturation by orders of magnitude. Moreover, equality saturation does not yet scale to complex compiler optimizations. These emerge from long rewrite sequences of thousands of rewrite steps, and may use pathological combinations of rewrite rules that cause the e-graph to quickly grow too large. This paper introduces \emph{sketch-guided equality saturation}, a semi-automatic technique that allows programmers to provide program sketches to guide rewriting. Sketch-guided equality saturation is evaluated for seven complex matrix multiplication optimizations, including loop blocking, vectorization, and multi-threading. Even with efficient lambda calculus encoding, unguided equality saturation can locate only the two simplest of these optimizations, the remaining five are undiscovered even with an hour of compilation time and 60GB of RAM. By specifying three or fewer sketch guides all seven optimizations are found in seconds of compilation time, using under 1GB of RAM, and generating high performance code. △ Less

Submitted 3 June, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

Comments: 23 pages excluding references, submitted to OOPLSA 2022

arXiv:2103.13390 [pdf, ps, other]

Row-Polymorphic Types for Strategic Rewriting

Authors: Rongxiao Fu, Xueying Qin, Ornela Dardha, Michel Steuwer

Abstract: We present a type system for strategy languages that express program transformations as compositions of rewrite rules. Our row-polymorphic type system assists compiler engineers to write correct strategies by statically rejecting non meaningful compositions of rewrites that otherwise would fail during rewriting at runtime. Furthermore, our type system enables reasoning about how rewriting transfor… ▽ More We present a type system for strategy languages that express program transformations as compositions of rewrite rules. Our row-polymorphic type system assists compiler engineers to write correct strategies by statically rejecting non meaningful compositions of rewrites that otherwise would fail during rewriting at runtime. Furthermore, our type system enables reasoning about how rewriting transforms the shape of the computational program. We present a formalization of our language at its type system and demonstrate its practical use for expressing compiler optimization strategies. Our type system builds the foundation for many interesting future applications, including verifying the correctness of program transformations and synthesizing program transformations from specifications encoded as types. △ Less

Submitted 23 March, 2021; originally announced March 2021.

arXiv:2002.02268 [pdf, other]

A Language for Describing Optimization Strategies

Authors: Bastian Hagedorn, Johannes Lenfers, Thomas Koehler, Sergei Gorlatch, Michel Steuwer

Abstract: Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized hardware d… ▽ More Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized hardware devices to further increase efficiency. Many emerging DSLs used in performance demanding domains such as deep learning, automatic differentiation, or image processing attempt to simplify or even fully automate the optimization process. Using a high-level - often functional - language, programmers focus on describing functionality in a declarative way. In some systems such as Halide or TVM, a separate schedule specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages. Instead, they are implemented as a set of ad-hoc predefined APIs that the compiler writers have exposed. In this paper, we present Elevate: a functional language for describing optimization strategies. Elevate follows a tradition of prior systems used in different contexts that express optimization strategies as composition of rewrites. In contrast to systems with scheduling APIs, in Elevate programmers are not restricted to a set of built-in optimizations but define their own optimization strategies freely in a composable way. We show how user-defined optimization strategies in Elevate enable the effective optimization of programs expressed in a functional data-parallel language demonstrating competitive performance with Halide and TVM. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: https://elevate-lang.org/ https://github.com/elevate-lang

arXiv:1710.08332 [pdf, other]

Strategy Preserving Compilation for Parallel Functional Code

Authors: Robert Atkey, Michel Steuwer, Sam Lindley, Christophe Dubach

Abstract: Graphics Processing Units (GPUs) and other parallel devices are widely available and have the potential for accelerating a wide class of algorithms. However, expert programming skills are required to achieving maximum performance. hese devices expose low-level hardware details through imperative programming interfaces where programmers explicity encode device-specific optimisation strategies. This… ▽ More Graphics Processing Units (GPUs) and other parallel devices are widely available and have the potential for accelerating a wide class of algorithms. However, expert programming skills are required to achieving maximum performance. hese devices expose low-level hardware details through imperative programming interfaces where programmers explicity encode device-specific optimisation strategies. This inevitably results in non-performance-portable programs delivering suboptimal performance on other devices. Functional programming models have recently seen a renaissance in the systems community as they offer possible solutions for tackling the performance portability challenge. Recent work has shown how to automatically choose high-performance parallelisation strategies for a wide range of hardware architectures encoded in a functional representation. However, the translation of such functional representations to the imperative program expected by the hardware interface is typically performed ad hoc with no correctness guarantees and no guarantees to preserve the intended parallelisation strategy. In this paper, we present a formalised strategy-preserving translation from high-level functional code to low-level data race free parallel imperative code. This translation is formulated and proved correct within a language we call Data Parallel Idealised Algol (DPIA), a dialect of Reynolds' Idealised Algol. Performance results on GPUs and a multicore CPU show that the formalised translation process generates low-level code with performance on a par with code generated from ad hoc approaches. △ Less

Submitted 23 October, 2017; originally announced October 2017.

arXiv:1511.02490 [pdf, other]

Autotuning OpenCL Workgroup Size for Stencil Patterns

Authors: Chris Cummins, Pavlos Petoumenos, Michel Steuwer, Hugh Leather

Abstract: Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To address this, we propo… ▽ More Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To address this, we propose the use of machine learning-enabled autotuning to automatically predict workgroup sizes for stencil patterns on CPUs and multi-GPUs. We present three methodologies for predicting workgroup sizes. The first, using classifiers to select the optimal workgroup size. The second and third proposed methodologies employ the novel use of regressors for performing classification by predicting the runtime of kernels and the relative performance of different workgroup sizes, respectively. We evaluate the effectiveness of each technique in an empirical study of 429 combinations of architecture, kernel, and dataset, comparing an average of 629 different workgroup sizes for each. We find that autotuning provides a median 3.79x speedup over the best possible fixed workgroup size, achieving 94% of the maximum performance. △ Less

Submitted 6 January, 2016; v1 submitted 8 November, 2015; originally announced November 2015.

Comments: 8 pages, 6 figures, presented at the 6th International Workshop on Adaptive Self-tuning Computing Systems (ADAPT '16)

arXiv:1502.02389 [pdf, other]

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Authors: Michel Steuwer, Christian Fensch, Christophe Dubach

Abstract: Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performa… ▽ More Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performance or is written in a high-level language to achieve portability at the expense of performance. We propose a novel approach that offers high-level programming, code portability and high-performance. It is based on algorithmic pattern composition coupled with a powerful, yet simple, set of rewrite rules. This enables systematic transformation and optimization of a high-level program into a low-level hardware specific representation which leads to high performance code. We test our design in practice by describing a subset of the OpenCL programming model with low-level patterns and by implementing a compiler which generates high performance OpenCL code. Our experiments show that we can systematically derive high-performance device-specific implementations from simple high-level algorithmic expressions. The performance of the generated OpenCL code is on par with highly tuned implementations for multicore CPUs and GPUs written by experts △ Less

Submitted 9 February, 2015; originally announced February 2015.

Comments: Technical Report

ACM Class: D.3.3; D.3.4

Showing 1–14 of 14 results for author: Steuwer, M