Search | arXiv e-print repository

Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming Approach

Authors: Stéphane Pouget, Louis-Noël Pouchet, Jason Cong

Abstract: High-Level Synthesis enables the rapid prototy** of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be described by inserting pragmas e.g. pipelining and replication of units, or even higher level transformations for HLS such as automatic data caching using the AMD… ▽ More High-Level Synthesis enables the rapid prototy** of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be described by inserting pragmas e.g. pipelining and replication of units, or even higher level transformations for HLS such as automatic data caching using the AMD/Xilinx Merlin compiler. Selecting the best combination of pragmas, even within a restricted set, remains particularly challenging and the typical state-of-practice uses design-space exploration to navigate this space. But due to the highly irregular performance distribution of pragma configurations, typical DSE approaches are either extremely time consuming, or operating on a severely restricted search space. This work proposes a framework to automatically insert HLS pragmas in regular loop-based programs, supporting pipelining, unit replication, and data caching. We develop an analytical performance and resource model as a function of the input program properties and pragmas inserted, using non-linear constraints and objectives. We prove this model provides a lower bound on the actual performance after HLS. We then encode this model as a Non-Linear Program, by making the pragma configuration unknowns of the system, which is computed optimally by solving this NLP. This approach can also be used during DSE, to quickly prune points with a (possibly partial) pragma configuration, driven by lower bounds on achievable latency. We extensively evaluate our end-to-end, fully implemented system, showing it can effectively manipulate spaces of billions of designs in seconds to minutes for the kernels evaluated. △ Less

Submitted 30 June, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.03058 [pdf, other]

Enhancing High-Level Synthesis with Automated Pragma Insertion and Code Transformation Framework

Authors: Stéphane Pouget, Louis-Noël Pouchet, Jason Cong

Abstract: High-level synthesis, source-to-source compilers, and various Design Space Exploration techniques for pragma insertion have significantly improved the Quality of Results of generated designs. These tools offer benefits such as reduced development time and enhanced performance. However, achieving high-quality results often requires additional manual code transformations and tiling selections, which… ▽ More High-level synthesis, source-to-source compilers, and various Design Space Exploration techniques for pragma insertion have significantly improved the Quality of Results of generated designs. These tools offer benefits such as reduced development time and enhanced performance. However, achieving high-quality results often requires additional manual code transformations and tiling selections, which are typically performed separately or as pre-processing steps. Although DSE techniques enable code transformation upfront, the vastness of the search space often limits the exploration of all possible code transformations, making it challenging to determine which transformations are necessary. Additionally, ensuring correctness remains challenging, especially for complex transformations and optimizations. To tackle this obstacle, we first propose a comprehensive framework leveraging HLS compilers. Our system streamlines code transformation, pragma insertion, and tiles size selection for on-chip data caching through a unified optimization problem, aiming to enhance parallelization, particularly beneficial for computation-bound kernels. Them employing a novel Non-Linear Programming (NLP) approach, we simultaneously ascertain transformations, pragmas, and tile sizes, focusing on regular loop-based kernels. Our evaluation demonstrates that our framework adeptly identifies the appropriate transformations, including scenarios where no transformation is necessary, and inserts pragmas to achieve a favorable Quality of Results. △ Less

Submitted 21 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

arXiv:2109.10476 [pdf, other]

doi 10.1109/TSE.2023.3271065

Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules

Authors: Steve Kommrusch, Martin Monperrus, Louis-Noël Pouchet

Abstract: We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if… ▽ More We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for one single grammar which can represent straight-line programs with function calls and multiple types. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programs. △ Less

Submitted 8 July, 2023; v1 submitted 21 September, 2021; originally announced September 2021.

Comments: 30 pages including appendix

Journal ref: IEEE Transactions on Software Engineering, 2023

arXiv:2106.02452 [pdf, other]

Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models

Authors: Steve Kommrusch, Théo Barollet, Louis-Noël Pouchet

Abstract: We target the problem of provably computing the equivalence between two complex expression trees. To this end, we formalize the problem of equivalence between two such programs as finding a set of semantics-preserving rewrite rules from one into the other, such that after the rewrite the two programs are structurally identical, and therefore trivially equivalent.We then develop a graph-to-sequence… ▽ More We target the problem of provably computing the equivalence between two complex expression trees. To this end, we formalize the problem of equivalence between two such programs as finding a set of semantics-preserving rewrite rules from one into the other, such that after the rewrite the two programs are structurally identical, and therefore trivially equivalent.We then develop a graph-to-sequence neural network system for program equivalence, trained to produce such rewrite sequences from a carefully crafted automatic example generation algorithm. We extensively evaluate our system on a rich multi-type linear algebra expression language, using arbitrary combinations of 100+ graph-rewriting axioms of equivalence. Our machine learning system guarantees correctness for all true negatives, and ensures 0 false positive by design. It outputs via inference a valid proof of equivalence for 93% of the 10,000 equivalent expression pairs isolated for testing, using up to 50-term expressions. In all cases, the validity of the sequence produced and therefore the provable assertion of program equivalence is always computable, in negligible time. △ Less

Submitted 8 June, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: 10 pages (24 including references and appendices), 8 figures, 17 tables. arXiv admin note: substantial text overlap with arXiv:2002.06799. Updated to include funding acknowledgement

arXiv:2012.11473 [pdf, other]

PALMED: Throughput Characterization for Superscalar Architectures -- Extended Version

Authors: Nicolas Derumigny, Fabian Gruber, Théophile Bastian, Guillaume Iooss, Christophe Guillon, Louis-Noël Pouchet, Fabrice Rastello

Abstract: In a super-scalar architecture, the scheduler dynamically assigns micro-operations ($μ$OPs) to execution ports. The port map** of an architecture describes how an instruction decomposes into $μ$OPs and lists for each $μ$OP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeated… ▽ More In a super-scalar architecture, the scheduler dynamically assigns micro-operations ($μ$OPs) to execution ports. The port map** of an architecture describes how an instruction decomposes into $μ$OPs and lists for each $μ$OP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop. This paper introduces a dual equivalent representation: The resource map** of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port map** is an important but difficult problem. Building a resource map** is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource map** for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements. We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel's Skylake microarchitecture. △ Less

Submitted 18 January, 2022; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2011.05422 [pdf, other]

Coherence Traffic in Manycore Processors with Opaque Distributed Directories

Authors: Steve Kommrusch, Marcos Horro, Louis-Noël Pouchet, Gabriel Rodríguez, Juan Touriño

Abstract: Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting "tiles", each of which contains computation cores, local caches, and coherence masters. The distrib… ▽ More Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting "tiles", each of which contains computation cores, local caches, and coherence masters. The distributed coherence subsystem must be queried for every out-of-tile access, imposing an overhead on memory latency. This paper studies the physical layout of an Intel Knights Landing processor, with a particular focus on the coherence subsystem, and uncovers the pseudo-random map** function of physical memory blocks across the pieces of the distributed directory. Leveraging this knowledge, candidate optimizations to improve memory latency through the minimization of coherence traffic are studied. Although these optimizations do improve memory throughput, ultimately this does not translate into performance gains due to inherent overheads stemming from the computational complexity of the map** functions. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: 17 pages, 13 figures, submitted to IEEE for possible publication

arXiv:2002.06799 [pdf, other]

Equivalence of Dataflow Graphs via Rewrite Rules Using a Graph-to-Sequence Neural Model

Authors: Steve Kommrusch, Théo Barollet, Louis-Noël Pouchet

Abstract: In this work we target the problem of provably computing the equivalence between two programs represented as dataflow graphs. To this end, we formalize the problem of equivalence between two programs as finding a set of semantics-preserving rewrite rules from one into the other, such that after the rewrite the two programs are structurally identical, and therefore trivially equivalent. We then dev… ▽ More In this work we target the problem of provably computing the equivalence between two programs represented as dataflow graphs. To this end, we formalize the problem of equivalence between two programs as finding a set of semantics-preserving rewrite rules from one into the other, such that after the rewrite the two programs are structurally identical, and therefore trivially equivalent. We then develop the first graph-to-sequence neural network system for program equivalence, trained to produce such rewrite sequences from a carefully crafted automatic example generation algorithm. We extensively evaluate our system on a rich multi-type linear algebra expression language, using arbitrary combinations of 100+ graph-rewriting axioms of equivalence. Our system outputs via inference a correct rewrite sequence for 96% of the 10,000 program pairs isolated for testing, using 30-term programs. And in all cases, the validity of the sequence produced and therefore the provable assertion of program equivalence is computable, in negligible time. △ Less

Submitted 3 June, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

Comments: 20 pages including references and appendices, 10 figures, updated to include acknowledgement

arXiv:1911.06664 [pdf, other]

Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs

Authors: Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, Fabrice Rastello

Abstract: For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how effi… ▽ More For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how efficiently the cache is used, e.g., untiled versus effectively tiled matrix-matrix multiplication. A significant current challenge is that no existing tool is able to meaningfully quantify the potential reduction to the data movement of a computation that can be achieved by more effective use of the cache through operation rescheduling. Asymptotic parametric expressions of data movement lower bounds have previously been manually derived for a limited number of algorithms, often without scaling constants. In this paper, we present the first compile-time approach for deriving non-asymptotic parametric expressions of data movement lower bounds for arbitrary affine computations. The approach has been implemented in a fully automatic tool (IOLB) that can generate these lower bounds for input affine programs. IOLB's use is demonstrated by exercising it on all the benchmarks of the PolyBench suite. The advantages of IOLB are many: (1) IOLB enables us to derive bounds for few dozens of algorithms for which these lower bounds have never been derived. This reflects an increase of productivity by automation. (2) Anyone is able to obtain these lower bounds through IOLB, no expertise is required. (3) For some of the most well-studied algorithms, the lower bounds obtained by \tool are higher than any previously reported manually derived lower bounds. △ Less

Submitted 15 November, 2019; originally announced November 2019.

arXiv:1901.01808 [pdf, other]

doi 10.1109/TSE.2019.2940179

SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

Authors: Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, Martin Monperrus

Abstract: This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 s… ▽ More This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 samples, carefully curated from commits to open-source repositories. We evaluate it on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SequenceR is able to perfectly predict the fixed line for 950/4711 testing samples, and find correct patches for 14 bugs in Defects4J. It captures a wide range of repair operators without any domain-specific top-down design. △ Less

Submitted 9 September, 2019; v1 submitted 24 December, 2018; originally announced January 2019.

Comments: 21 pages, 15 figures

Journal ref: IEEE Transactions on Software Engineering, 2019

arXiv:1811.07999 [pdf, other]

Synthetic Lung Nodule 3D Image Generation Using Autoencoders

Authors: Steve Kommrusch, Louis-Noël Pouchet

Abstract: One of the challenges of using machine learning techniques with medical data is the frequent dearth of source image data on which to train. A representative example is automated lung cancer diagnosis, where nodule images need to be classified as suspicious or benign. In this work we propose an automatic synthetic lung nodule image generator. Our 3D shape generator is designed to augment the variet… ▽ More One of the challenges of using machine learning techniques with medical data is the frequent dearth of source image data on which to train. A representative example is automated lung cancer diagnosis, where nodule images need to be classified as suspicious or benign. In this work we propose an automatic synthetic lung nodule image generator. Our 3D shape generator is designed to augment the variety of 3D images. Our proposed system takes root in autoencoder techniques, and we provide extensive experimental characterization that demonstrates its ability to produce quality synthetic images. △ Less

Submitted 9 September, 2019; v1 submitted 19 November, 2018; originally announced November 2018.

Comments: 19 pages, 12 figures, full paper for work initially presented at IJCAI 2018

Report number: CS-18-101

arXiv:1811.06043 [pdf, other]

A Performance Vocabulary for Affine Loop Transformations

Authors: Martin Kong, Louis-Noël Pouchet

Abstract: Modern polyhedral compilers excel at aggressively optimizing codes with static control parts, but the state-of-practice to find high-performance polyhedral transformations especially for different hardware targets still largely involves auto-tuning. In this work we propose a novel polyhedral scheduling technique, with the aim to reduce the need for auto-tuning while allowing to build customizable… ▽ More Modern polyhedral compilers excel at aggressively optimizing codes with static control parts, but the state-of-practice to find high-performance polyhedral transformations especially for different hardware targets still largely involves auto-tuning. In this work we propose a novel polyhedral scheduling technique, with the aim to reduce the need for auto-tuning while allowing to build customizable and specific transformation strategies. We design constraints and objectives that model several crucial aspects of performance such as stride optimization or the trade-off between parallelism and reuse, while taking into account important architectural features of the target machine. The developed set of objectives embody a Performance Vocabulary for loop transformations. The goal is to use this vocabulary, consisting of performance idioms, to construct transformation recipes adapted to a number of program classes. We evaluate our work using the PolyBench/C benchmark suite and experimentally validate it against large optimization spaces generated with the Pluto compiler on a 10-core Intel Core-i9 (Skylake-X). Our results show that we can achieve comparable or superior performance to Pluto on the majority of benchmarks, without implementing tiling in the source code nor using experimental autotuning. △ Less

Submitted 9 April, 2019; v1 submitted 14 November, 2018; originally announced November 2018.

MSC Class: 68N20 ACM Class: D.2.6; D.3.0

arXiv:1411.2286 [pdf, ps, other]

doi 10.1145/2676726.2677010

On Characterizing the Data Access Complexity of Programs

Authors: Venmugil Elango, Fabrice Rastello, Louis-Noel Pouchet, J. Ramanujam, P. Sadayappan

Abstract: Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of develo** lower bounds for data access complexity has be… ▽ More Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of develo** lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by develo** an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size. △ Less

Submitted 9 November, 2014; originally announced November 2014.

ACM Class: F.2; D.2.8

arXiv:1404.4767 [pdf, other]

On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

Authors: Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan

Abstract: Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerabl… ▽ More Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Report number: RR-8522

arXiv:1401.5024 [pdf, other]

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Authors: Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, P. Sadayappan

Abstract: Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order… ▽ More Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance. △ Less

Submitted 21 December, 2013; originally announced January 2014.

Comments: Transaction on Architecture and Code Optimization (2014)

arXiv:1111.6756 [pdf, ps, other]

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

Authors: Riyadh Baghdadi, Albert Cohen, Cedric Bastoul, Louis-Noel Pouchet, Lawrence Rauchwerger

Abstract: Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism ada… ▽ More Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and map** challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and map** challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces. △ Less

Submitted 29 November, 2011; originally announced November 2011.

Showing 1–15 of 15 results for author: Pouchet, L