Search | arXiv e-print repository

Compiler generated feedback for Large Language Models

Authors: Dejan Grubisic, Chris Cummins, Volker Seeker, Hugh Leather

Abstract: We introduce a novel paradigm in compiler optimization powered by Large Language Models with compiler feedback to optimize the code size of LLVM assembly. The model takes unoptimized LLVM IR as input and produces optimized IR, the best optimization passes, and instruction counts of both unoptimized and optimized IRs. Then we compile the input with generated optimization passes and evaluate if the… ▽ More We introduce a novel paradigm in compiler optimization powered by Large Language Models with compiler feedback to optimize the code size of LLVM assembly. The model takes unoptimized LLVM IR as input and produces optimized IR, the best optimization passes, and instruction counts of both unoptimized and optimized IRs. Then we compile the input with generated optimization passes and evaluate if the predicted instruction count is correct, generated IR is compilable, and corresponds to compiled code. We provide this feedback back to LLM and give it another chance to optimize code. This approach adds an extra 0.53% improvement over -Oz to the original model. Even though, adding more information with feedback seems intuitive, simple sampling techniques achieve much higher performance given 10 or more samples. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11522 [pdf]

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

Authors: Massinissa Merouani, Khaled Afif Boudaoud, Iheb Nassim Aouadj, Nassim Tchoulak, Islem Kara Bernou, Hamza Benyamina, Fatima Benbouzid-Si Tayeb, Karima Benatchba, Hugh Leather, Riyadh Baghdadi

Abstract: While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept… ▽ More While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of this approach. While such a proof-of-concept has shown promise, it still has significant limitations. State-of-the-art polyhedral compilers that use a deep-learning cost model only support a small subset of affine transformations, limiting their ability to apply complex code transformations. They also only support simple programs that have a single loop nest and a rectangular iteration domain, limiting their applicability to many programs. These limitations significantly impact the generality of such compilers and autoschedulers and put into question the whole approach. In this paper, we introduce LOOPer, the first polyhedral autoscheduler that uses a deep-learning based cost model and covers a large set of affine transformations and programs. It supports the exploration of a large set of affine transformations, allowing the application of complex sequences of polyhedral transformations. It also supports the optimization of programs with multiple loop nests and with rectangular and non-rectangular iteration domains, allowing the optimization of an extensive set of programs. We implement and evaluate LOOPer and show that it achieves speedups over the state-of-the-art. On the Polybench benchmark, LOOPer achieves a geometric mean speedup of 1.59x over Tiramisu. LOOPer also achieves competitive speedups with a geometric mean speedup of 1.34x over Pluto, a state-of-the-art polyhedral compiler that does not use a machine-learning based cost model. △ Less

Submitted 22 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.18734 [pdf, other]

Priority Sampling of Large Language Models for Compilers

Authors: Dejan Grubisic, Chris Cummins, Volker Seeker, Hugh Leather

Abstract: Large language models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. Furthermore, the temperature coefficient has to be tuned for each task, limiting its usability. We present Priority Sampling, a s… ▽ More Large language models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. Furthermore, the temperature coefficient has to be tuned for each task, limiting its usability. We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. Each new sample expands the unexpanded token with the highest probability in the augmented search tree. Additionally, Priority Sampling supports generation based on regular expression that provides a controllable and structured exploration process. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz. Moreover, it outperforms the autotuner used for the generation of labels for the training of the original model in just 30 samples. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2401.03065 [pdf, other]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Authors: Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang

Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second… ▽ More We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 71 pages, 29 figures

arXiv:2310.02110 [pdf, other]

Sieve: Multimodal Dataset Pruning Using Image Captioning Models

Authors: Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, Ari Morcos

Abstract: Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. W… ▽ More Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. We argue that this approach suffers from multiple limitations including: false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text), we estimate the semantic textual similarity in the embedding space of a language model pretrained on unlabeled text corpus. Using DataComp, a multimodal dataset filtering benchmark, when evaluating on 38 downstream tasks, our pruning approach, surpasses CLIPScore by 2.6\% and 1.7\% on medium and large scale respectively. In addition, on retrieval tasks, Sieve leads to a significant improvement of 2.7% and 4.5% on medium and large scale respectively. △ Less

Submitted 10 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted in CVPR 2024

arXiv:2309.07062 [pdf, other]

Large Language Models for Compiler Optimization

Authors: Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather

Abstract: We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after opt… ▽ More We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after optimization, and the optimized code itself. These auxiliary learning tasks significantly improve the optimization performance of the model and improve the model's depth of understanding. We evaluate on a large suite of test programs. Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler, outperforming two state-of-the-art baselines that require thousands of compilations. Furthermore, the model shows surprisingly strong code reasoning abilities, generating compilable code 91% of the time and perfectly emulating the output of the compiler 70% of the time. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2303.01557 [pdf, other]

BenchDirect: A Directed Language Model for Compiler Benchmarks

Authors: Foivos Tsimpourlas, Pavlos Petoumenos, Min Xu, Chris Cummins, Kim Hazelwood, Ajitha Rajan, Hugh Leather

Abstract: The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers to synthesize benchmarks into existing… ▽ More The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers to synthesize benchmarks into existing datasets. However, the synthetic programs are short, exceedingly simple and lacking diversity in their features. We develop BenchPress, the first ML compiler benchmark generator that can be directed within source code feature representations. BenchPress synthesizes executable functions by infilling code that conditions on the program's left and right context. BenchPress uses active learning to introduce new benchmarks with unseen features into the dataset of Grewe's et al. CPU vs GPU heuristic, improving its acquired performance by 50%. BenchPress targets features that has been impossible for other synthesizers to reach. In 3 feature spaces, we outperform human-written code from GitHub, CLgen, CLSmith and the SRCIROR mutator in targeting the features of Rodinia benchmarks. BenchPress steers generation with beam search over a feature-agnostic language model. We improve this with BenchDirect which utilizes a directed LM that infills programs by jointly observing source code context and the compiler features that are targeted. BenchDirect achieves up to 36% better accuracy in targeting the features of Rodinia benchmarks, it is 1.8x more likely to give an exact match and it speeds up execution time by up to 72% compared to BenchPress. Both our models produce code that is difficult to distinguish from human-written code. We conduct a Turing test which shows our models' synthetic benchmarks are labelled as 'human-written' as often as human-written code from GitHub. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2208.06555

arXiv:2301.05104 [pdf, other]

Learning Compiler Pass Orders using Coreset and Normalized Value Prediction

Authors: Youwei Liang, Kevin Stone, Ali Shameli, Chris Cummins, Mostafa Elhoushi, Jiadong Guo, Benoit Steiner, Xiaomeng Yang, Pengtao Xie, Hugh Leather, Yuandong Tian

Abstract: Finding the optimal pass sequence of compilation can lead to a significant reduction in program size and/or improvement in program efficiency. Prior works on compilation pass ordering have two major drawbacks. They either require an excessive budget (in terms of compilation steps) at compile time or fail to generalize to unseen programs. In this paper, for code-size reduction tasks, we propose a n… ▽ More Finding the optimal pass sequence of compilation can lead to a significant reduction in program size and/or improvement in program efficiency. Prior works on compilation pass ordering have two major drawbacks. They either require an excessive budget (in terms of compilation steps) at compile time or fail to generalize to unseen programs. In this paper, for code-size reduction tasks, we propose a novel pipeline to find program-dependent pass sequences within 45 compilation calls. It first identifies a coreset of 50 pass sequences via greedy optimization of a submodular function, and then learns a policy with Graph Neural Network (GNN) to pick the optimal sequence by predicting the normalized values of the pass sequences in the coreset. Despite its simplicity, our pipeline outperforms the default -Oz flag by an average of 4.7% over a large collection (4683) of unseen code repositories from diverse domains across 14 datasets. In comparison, previous approaches like reinforcement learning on the raw pass sequence space may take days to train due to sparse reward, and may not generalize well in held-out ones from different domains. Our results demonstrate that existing human-designed compiler flags can be improved with a simple yet effective technique that transforms the raw action space into a small one with denser rewards. △ Less

Submitted 27 January, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

arXiv:2212.11353 [pdf, other]

Contrastive Distillation Is a Sample-Efficient Self-Supervised Loss Policy for Transfer Learning

Authors: Chris Lengerich, Gabriel Synnaeve, Amy Zhang, Hugh Leather, Kurt Shuster, François Charton, Charysse Redwood

Abstract: Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional represen… ▽ More Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional representations from hand-designed and context-independent self-supervised losses (multi-view) still adapts relatively slowly to the real world, which contains many non-IID subspaces requiring rapid distribution shift in both time and spatial attention patterns at varying levels of abstraction. In contrast, supervised language model cascades have shown the flexibility to adapt to many diverse manifolds, and hints of self-learning needed for autonomous task transfer. However, to date, transfer methods for language models like few-shot learning and fine-tuning still require human supervision and transfer learning using self-learning methods has been underexplored. We propose a self-supervised loss policy called contrastive distillation which manifests latent variables with high mutual information with both source and target tasks from weights to tokens. We show how this outperforms common methods of transfer learning and suggests a useful design axis of trading off compute for generalizability for online transfer. Contrastive distillation is improved through sampling from memory and suggests a simple algorithm for more efficiently sampling negative examples for contrastive losses than random sampling. △ Less

Submitted 21 December, 2022; originally announced December 2022.

arXiv:2208.06555 [pdf, other]

BenchPress: A Deep Active Benchmark Generator

Authors: Foivos Tsimpourlas, Pavlos Petoumenos, Min Xu, Chris Cummins, Kim Hazelwood, Ajitha Rajan, Hugh Leather

Abstract: We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired target fea… ▽ More We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired target features that has been impossible for state of the art synthesizers (or indeed humans) to reach. It performs better in targeting the features of Rodinia benchmarks in 3 different feature spaces compared with (a) CLgen - a state of the art ML synthesizer, (b) CLSmith fuzzer, (c) SRCIROR mutator or even (d) human-written code from GitHub. BenchPress is the first generator to search the feature space with active learning in order to generate benchmarks that will improve a downstream task. We show how using BenchPress, Grewe's et al. CPU vs GPU heuristic model can obtain a higher speedup when trained on BenchPress's benchmarks compared to other techniques. BenchPress is a powerful code generator: Its generated samples compile at a rate of 86%, compared to CLgen's 2.33%. Starting from an empty fixed input, BenchPress produces 10x more unique, compiling OpenCL benchmarks than CLgen, which are significantly larger and more feature diverse. △ Less

Submitted 15 August, 2022; v1 submitted 12 August, 2022; originally announced August 2022.

Comments: To appear in PACT 2022

arXiv:2207.03578 [pdf, other]

Code Translation with Compiler Representations

Authors: Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve

Abstract: In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looki… ▽ More In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java -> Rust pair with greedy decoding. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation. △ Less

Submitted 24 April, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

Comments: 9 pages

arXiv:2206.03684 [pdf]

Progress Report: A Deep Learning Guided Exploration of Affine Unimodular Loop Transformations

Authors: Massinissa Merouani, Khaled Afif Boudaoud, Iheb Nassim Aouadj, Nassim Tchoulak, Fatima Benbouzid-Sitayeb, Karima Benatchba, Hugh Leather, Riyadh Baghdadi

Abstract: In this paper, we present a work in progress about a deep learning based approach for automatic code optimization in polyhedral compilers. The proposed technique explores combinations of affine and non-affine loop transformations to find the sequence of transformations that minimizes the execution time of a given program. This exploration is guided by a deep learning based cost model that evaluate… ▽ More In this paper, we present a work in progress about a deep learning based approach for automatic code optimization in polyhedral compilers. The proposed technique explores combinations of affine and non-affine loop transformations to find the sequence of transformations that minimizes the execution time of a given program. This exploration is guided by a deep learning based cost model that evaluates the speedup that each sequence of transformations would yield. Preliminary results show that the proposed techniques achieve a 2.35x geometric mean speedup over state of the art polyhedral compilers (Pluto). △ Less

Submitted 8 June, 2022; originally announced June 2022.

Journal ref: In Proceedings of 12th International Workshop on Polyhedral Compilation Techniques (IMPACT 2022)

arXiv:2205.00618 [pdf, other]

LoopStack: a Lightweight Tensor Algebra Compiler Stack

Authors: Bram Wasti, José Pablo Cambronero, Benoit Steiner, Hugh Leather, Aleksandar Zlateski

Abstract: We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. W… ▽ More We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. We evaluate our stack on a collection of full neural networks and commonly used network blocks as well as individual operators, and show that LoopStack generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks in both cases. We also show that for a large collection of schedules LoopNest's compilation is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance. Additionally, LoopStack has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices. △ Less

Submitted 1 May, 2022; originally announced May 2022.

arXiv:2111.12116 [pdf, other]

Caviar: An E-graph Based TRS for Automatic Code Optimization

Authors: Smail Kourta, Adel Namani, Fatima Benbouzid-Si Tayeb, Kim Hazelwood, Chris Cummins, Hugh Leather, Riyadh Baghdadi

Abstract: Term Rewriting Systems (TRSs) are used in compilers to simplify and prove expressions. State-of-the-art TRSs in compilers use a greedy algorithm that applies a set of rewriting rules in a predefined order (where some of the rules are not axiomatic). This leads to a loss of the ability to simplify certain expressions. E-graphs and equality saturation sidestep this issue by representing the differen… ▽ More Term Rewriting Systems (TRSs) are used in compilers to simplify and prove expressions. State-of-the-art TRSs in compilers use a greedy algorithm that applies a set of rewriting rules in a predefined order (where some of the rules are not axiomatic). This leads to a loss of the ability to simplify certain expressions. E-graphs and equality saturation sidestep this issue by representing the different equivalent expressions in a compact manner from which the optimal expression can be extracted. While an e-graph-based TRS can be more powerful than a TRS that uses a greedy algorithm, it is slower because expressions may have a large or sometimes infinite number of equivalent expressions. Accelerating e-graph construction is crucial for making the use of e-graphs practical in compilers. In this paper, we present Caviar, an e-graph-based TRS for proving expressions within compilers. The main advantage of Caviar is its speed. It can prove expressions much faster than base e-graph TRSs. It relies on three techniques: 1) a technique that stops e-graphs from growing when the goal is reached, called Iteration Level Check; 2) a mechanism that balances exploration and exploitation in the equality saturation algorithm, called Pulsing Caviar; 3) a technique to stop e-graph construction before reaching saturation when a non-provable pattern is detected, called Non-Provable Patterns Detection (NPPD). We evaluate caviar on Halide, an optimizing compiler that relies on a greedy-algorithm-based TRS to simplify and prove its expressions. The proposed techniques allow Caviar to accelerate e-graph expansion for the task of proving expressions. They also allow Caviar to prove expressions that Halide's TRS cannot prove while being only 0.68x slower. △ Less

Submitted 27 February, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

Comments: Accepted in the 31st Conference on Compiler Construction (CC 2022)

arXiv:2109.08267 [pdf, other]

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research

Authors: Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, Hugh Leather

Abstract: Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is a… ▽ More Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field. We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling researchers to interact with compilers using Python and a familiar API. We describe the CompilerGym architecture and implementation, characterize the optimization spaces and computational efficiencies of three included compiler environments, and provide extensive empirical evaluations. Compared to prior works, CompilerGym offers larger datasets and optimization spaces, is 27x more computationally efficient, is fault-tolerant, and capable of detecting reproducibility bugs in the underlying compilers. In making it easy for anyone to experiment with compilers - irrespective of their background - we aim to accelerate progress in the AI and compiler research domains. △ Less

Submitted 22 December, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: 12 pages. Source code available at https://github.com/facebookresearch/CompilerGym

arXiv:2108.12489 [pdf, ps, other]

Using Graph Neural Networks to model the performance of Deep Neural Networks

Authors: Shikhar Singh, Benoit Steiner, James Hegarty, Hugh Leather

Abstract: With the unprecedented proliferation of machine learning software, there is an ever-increasing need to generate efficient code for such applications. State-of-the-art deep-learning compilers like TVM and Halide incorporate a learning-based performance model to search the space of valid implementations of a given deep learning algorithm. For a given application, the model generates a performance me… ▽ More With the unprecedented proliferation of machine learning software, there is an ever-increasing need to generate efficient code for such applications. State-of-the-art deep-learning compilers like TVM and Halide incorporate a learning-based performance model to search the space of valid implementations of a given deep learning algorithm. For a given application, the model generates a performance metric such as the run time without executing the application on hardware. Such models speed up the compilation process by obviating the need to benchmark an enormous number of candidate implementations, referred to as schedules, on hardware. Existing performance models employ feed-forward networks, recurrent networks, or decision tree ensembles to estimate the performance of different implementations of a neural network. Graphs present a natural and intuitive way to model deep-learning networks where each node represents a computational stage or operation. Incorporating the inherent graph structure of these workloads in the performance model can enable a better representation and learning of inter-stage interactions. The accuracy of a performance model has direct implications on the efficiency of the search strategy, making it a crucial component of this class of deep-learning compilers. In this work, we develop a novel performance model that adopts a graph representation. In our model, each stage of computation represents a node characterized by features that capture the operations performed by the stage. The interaction between nodes is achieved using graph convolutions. Experimental evaluation shows a 7:75x and 12x reduction in prediction error compared to the Halide and TVM models, respectively. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2012.01470 [pdf, other]

Deep Data Flow Analysis

Authors: Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle

Abstract: Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow anal… ▽ More Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. To this end, we propose ProGraML - Program Graphs for Machine Learning - a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using ProGraML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks. △ Less

Submitted 20 November, 2020; originally announced December 2020.

Comments: 9 pages, plus appendices. arXiv admin note: text overlap with arXiv:2003.10536

arXiv:2011.14486 [pdf, other]

Value Function Based Performance Optimization of Deep Learning Workloads

Authors: Benoit Steiner, Chris Cummins, Horace He, Hugh Leather

Abstract: As machine learning techniques become ubiquitous, the efficiency of neural network implementations is becoming correspondingly paramount. Frameworks, such as Halide and TVM, separate out the algorithmic representation of the network from the schedule that determines its implementation. Finding good schedules, however, remains extremely challenging. We model this scheduling problem as a sequence of… ▽ More As machine learning techniques become ubiquitous, the efficiency of neural network implementations is becoming correspondingly paramount. Frameworks, such as Halide and TVM, separate out the algorithmic representation of the network from the schedule that determines its implementation. Finding good schedules, however, remains extremely challenging. We model this scheduling problem as a sequence of optimization choices, and present a new technique to accurately predict the expected performance of a partial schedule. By leveraging these predictions we can make these optimization decisions greedily and rapidly identify an efficient schedule. This enables us to find schedules that improve the throughput of deep neural networks by 2.6x over Halide and 1.5x over TVM. Moreover, our technique is two to three orders of magnitude faster than that of these tools, and completes in seconds instead of hours. △ Less

Submitted 29 November, 2020; originally announced November 2020.

arXiv:2003.10536 [pdf, other]

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Authors: Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather

Abstract: The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in… ▽ More The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device map** and program classification - setting new state-of-the-art performance in both. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: 20 pages, author preprint

arXiv:1511.02603 [pdf, other]

Iterative compilation on mobile devices

Authors: Paschalis Mpeis, Pavlos Petoumenos, Hugh Leather

Abstract: The abundance of poorly optimized mobile applications coupled with their increasing centrality in our digital lives make a framework for mobile app optimization an imperative. While tuning strategies for desktop and server applications have a long history, it is difficult to adapt them for use on mobile phones. Reference inputs which trigger behavior similar to a mobile application's typical are… ▽ More The abundance of poorly optimized mobile applications coupled with their increasing centrality in our digital lives make a framework for mobile app optimization an imperative. While tuning strategies for desktop and server applications have a long history, it is difficult to adapt them for use on mobile phones. Reference inputs which trigger behavior similar to a mobile application's typical are hard to construct. For many classes of applications the very concept of typical behavior is nonexistent, each user interacting with the application in very different ways. In contexts like this, optimization strategies need to evaluate their effectiveness against real user input, but doing so online runs the risk of user dissatisfaction when suboptimal optimizations are evaluated. In this paper we present an iterative compiler which employs a novel capture and replay technique in order to collect real user input and use it later to evaluate different transformations offline. The proposed mechanism identifies and stores only the set of memory pages needed to replay the most heavily used functions of the application. At idle periods, this minimal state is combined with different binaries of the application, each one build with different optimizations enabled. Replaying the targeted functions allows us to evaluate the effectiveness of each set of optimizations for the actual way the user interacts with the application. For the BEEBS benchmark suite, our approach was able to improve performance by up to 57%, while kee** the slowdown experienced by the user on average at 0.8%. By focusing only on heavily used functions, we are able to conserve storage space by between two and three orders of magnitude compared to typical capture and replay implementations. △ Less

Submitted 6 January, 2016; v1 submitted 9 November, 2015; originally announced November 2015.

Comments: 8 pages, 8 figures

arXiv:1511.02490 [pdf, other]

Autotuning OpenCL Workgroup Size for Stencil Patterns

Authors: Chris Cummins, Pavlos Petoumenos, Michel Steuwer, Hugh Leather

Abstract: Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To address this, we propo… ▽ More Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To address this, we propose the use of machine learning-enabled autotuning to automatically predict workgroup sizes for stencil patterns on CPUs and multi-GPUs. We present three methodologies for predicting workgroup sizes. The first, using classifiers to select the optimal workgroup size. The second and third proposed methodologies employ the novel use of regressors for performing classification by predicting the runtime of kernels and the relative performance of different workgroup sizes, respectively. We evaluate the effectiveness of each technique in an empirical study of 429 combinations of architecture, kernel, and dataset, comparing an average of 629 different workgroup sizes for each. We find that autotuning provides a median 3.79x speedup over the best possible fixed workgroup size, achieving 94% of the maximum performance. △ Less

Submitted 6 January, 2016; v1 submitted 8 November, 2015; originally announced November 2015.

Comments: 8 pages, 6 figures, presented at the 6th International Workshop on Adaptive Self-tuning Computing Systems (ADAPT '16)

Showing 1–21 of 21 results for author: Leather, H