-
Foams with 3D Spatially Programmed Mechanics Enabled by Autonomous Active Learning on Viscous Thread Printing
Authors:
Brett Emery,
Kelsey L. Snapp,
Daniel Revier,
Vivek Sarkar,
Masa Nakura,
Keith A. Brown,
Jeffrey Ian Lipton
Abstract:
Foams are versatile by nature and ubiquitous in a wide range of applications, including padding, insulation, and acoustic dampening. Previous work established that foams 3D printed via Viscous Thread Printing (VTP) can in principle combine the flexibility of 3D printing with the mechanical properties of conventional foams. However, the generality of prior work is limited due to the lack of predict…
▽ More
Foams are versatile by nature and ubiquitous in a wide range of applications, including padding, insulation, and acoustic dampening. Previous work established that foams 3D printed via Viscous Thread Printing (VTP) can in principle combine the flexibility of 3D printing with the mechanical properties of conventional foams. However, the generality of prior work is limited due to the lack of predictable process-property relationships. In this work, we utilize a self-driving lab that combines automated experimentation with machine learning to identify a processing subspace in which dimensionally consistent materials are produced using VTP with spatially programmable mechanical properties. In carrying out this process, we discover an underlying self-stabilizing characteristic of VTP layer thickness, an important feature for its extension to new materials and systems. Several complex exemplars are constructed to illustrate the newly enabled capabilities of foams produced via VTP, including 1D gradient rectangular slabs, 2D localized stiffness zones on an insole orthotic and living hinges, and programmed 3D deformation via a cable driven humanoid hand. Predictive map** models are developed and validated for both thermoplastic polyurethane (TPU) and polylactic acid (PLA) filaments, suggesting the ability to train a model for any material suitable for material extrusion (ME) 3D printing.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Enabling Multi-threading in Heterogeneous Quantum-Classical Programming Models
Authors:
Akihiro Hayashi,
Austin Adams,
Jeffrey Young,
Alexander McCaskey,
Eugene Dumitrescu,
Vivek Sarkar,
Thomas M. Conte
Abstract:
In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introd…
▽ More
In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results with the Quantum++ back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. The results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24 threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). Furthermore, the parallel version is better in terms of strong scalability. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototy** of parallel/asynchrony-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms in the long-term.
△ Less
Submitted 15 March, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing
Authors:
Jun Shirako,
Akihiro Hayashi,
Sri Raj Paul,
Alexey Tumanov,
Vivek Sarkar
Abstract:
This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obta…
▽ More
This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obtained by dynamic profiler tools; multi-version code generation guarantees the correctness of our AOT transformation in all cases.
Our compilation framework performs automatic parallelization and sophisticated high-level code optimizations for the target distributed heterogeneous hardware platform. It includes extensions to the polyhedral framework that unify user-written loops and implicit loops present in matrix/tensor operators, as well as automated section of CPU vs. GPU code variants. Further, our polyhedral optimizations enable both intra-node and inter-node parallelism. Finally, the optimized output code is deployed using the Ray runtime for scheduling distributed tasks across multiple heterogeneous nodes in a cluster.
Our empirical evaluation shows significant performance improvements relative to sequential Python in both single-node and multi-node experiments, with a performance improvement of over 20,000$\times$ when using 24 nodes and 144 GPUs in the OLCF Summit supercomputer for the Space-Time Adaptive Processing (STAP) radar application.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
A Scalable Actor-based Programming System for PGAS Runtimes
Authors:
Sri Raj Paul,
Akihiro Hayashi,
Kun Chen,
Vivek Sarkar
Abstract:
The PGAS model is well suited for executing irregular applications on cluster-based systems, due to its efficient support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability: despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are s…
▽ More
The PGAS model is well suited for executing irregular applications on cluster-based systems, due to its efficient support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability: despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity: while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20x slower than more efficient but less productive code that requires manual message aggregation and termination detection.
In this paper, we introduce a new programming system for PGAS applications, in which point-to-point remote operations can be expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Our approach can also be viewed as extending the classical Bulk Synchronous Parallelism model with fine-grained asynchronous communications within a phase or superstep. We believe that our approach offers a desirable point in the productivity-performance space for PGAS applications, with more scalable performance and higher productivity relative to past approaches. Specifically, for seven irregular mini-applications from the Bale benchmark suite executed using 2048 cores in the NERSC Cori system, our approach shows geometric mean performance improvements of >=20x relative to standard PGAS versions (UPC and OpenSHMEM) while maintaining comparable productivity to those versions.
△ Less
Submitted 18 June, 2022; v1 submitted 12 July, 2021;
originally announced July 2021.
-
An Ownership Policy and Deadlock Detector for Promises
Authors:
Caleb Voss,
Vivek Sarkar
Abstract:
Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or Transitive Joins. However, the promise, a popular synchronization primitive for parallel tasks, does not enjoy deadlock-freedom guarantees. Promises…
▽ More
Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or Transitive Joins. However, the promise, a popular synchronization primitive for parallel tasks, does not enjoy deadlock-freedom guarantees. Promises can exhibit deadlock-like bugs; however, the concept of a deadlock is not currently well-defined for promises.
To address these challenges, we propose an ownership semantics in which each promise is associated to the task which currently intends to fulfill it. Ownership immediately enables the identification of bugs in which a task fails to fulfill a promise for which it is responsible. Ownership further enables the discussion of deadlock cycles among tasks and promises and allows us to introduce a robust definition of deadlock-like bugs for promises.
Cycle detection in this context is non-trivial because it is concurrent with changes in promise ownership. We provide a lock-free algorithm for precise runtime deadlock detection. We show how to obtain the memory consistency criteria required for the correctness of our algorithm under TSO and the Java and C++ memory models. An evaluation compares the execution time and memory usage overheads of our detection algorithm on benchmark programs relative to an unverified baseline. Our detector exhibits a 12% (1.12$\times$) geometric mean time overhead and a 6% (1.06$\times$) geometric mean memory overhead, which are smaller overheads than in past approaches to deadlock cycle detection.
△ Less
Submitted 4 January, 2021;
originally announced January 2021.
-
Task-Graph Scheduling Extensions for Efficient Synchronization and Communication
Authors:
Seonmyeong Bak,
Oscar Hernandez,
Mark Gates,
Piotr Luszczek,
Vivek Sarkar
Abstract:
Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in programming models such as OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization from inner levels of data parallelism and internal blocking communications. In this paper, w…
▽ More
Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in programming models such as OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization from inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE highperformance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Advanced Graph-Based Deep Learning for Probabilistic Type Inference
Authors:
Fangke Ye,
Jisheng Zhao,
Vivek Sarkar
Abstract:
Dynamically typed languages such as JavaScript and Python have emerged as the most popular programming languages in use. Important benefits can accrue from including type annotations in dynamically typed programs. This approach to gradual ty** is exemplified by the TypeScript programming system which allows programmers to specify partially typed programs, and then uses static analysis to infer t…
▽ More
Dynamically typed languages such as JavaScript and Python have emerged as the most popular programming languages in use. Important benefits can accrue from including type annotations in dynamically typed programs. This approach to gradual ty** is exemplified by the TypeScript programming system which allows programmers to specify partially typed programs, and then uses static analysis to infer the remaining types. However, in general, the effectiveness of static type inference is limited and depends on the complexity of the program's structure and the initial type annotations. As a result, there is a strong motivation for new approaches that can advance the state of the art in statically predicting types in dynamically typed programs, and that do so with acceptable performance for use in interactive programming environments. Previous work has demonstrated the promise of probabilistic type inference using deep learning. In this paper, we advance past work by introducing a range of graph neural network (GNN) models that operate on a novel type flow graph (TFG) representation. The TFG represents an input program's elements as graph nodes connected with syntax edges and data flow edges, and our GNN models are trained to predict the type labels in the TFG for a given input program. We study different design choices for our GNN models for the 100 most common types in our evaluation dataset, and show that our best two GNN configurations for accuracy achieve a top-1 accuracy of 87.76% and 86.89% respectively, outperforming the two most closely related deep learning type inference approaches from past work -- DeepTyper with a top-1 accuracy of 84.62% and LambdaNet with a top-1 accuracy of 79.45%. Further, the average inference throughputs of those two configurations are 353.8 and 1,303.9 files/second, compared to 186.7 files/second for DeepTyper and 1,050.3 files/second for LambdaNet.
△ Less
Submitted 14 November, 2021; v1 submitted 13 September, 2020;
originally announced September 2020.
-
MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure
Authors:
Fangke Ye,
Shengtian Zhou,
Anand Venkat,
Ryan Marcus,
Nesime Tatbul,
Jesmin Jahan Tithi,
Niranjan Hasabnis,
Paul Petersen,
Timothy Mattson,
Tim Kraska,
Pradeep Dubey,
Vivek Sarkar,
Justin Gottschlich
Abstract:
Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses…
▽ More
Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii)MISIM uses an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems, including two additional hand-customized models, over 328K programs consisting of over 18 million lines of code. Our experiments show that MISIM has 8.08% better accuracy (using MAP@R) compared to the next best performing system.
△ Less
Submitted 2 June, 2021; v1 submitted 5 June, 2020;
originally announced June 2020.
-
Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine
Authors:
Prasanth Chatarasi,
Stephen Neuendorffer,
Samuel Bayliss,
Kees Vissers,
Vivek Sarkar
Abstract:
Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly-level programming, it requires the programmer to specify a number of low-level operations based on de…
▽ More
Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly-level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D and 6 CONV3D workloads, and achieved geometric means of 7.6 and 23.3 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 72.8% of the peak performance respectively). For 4 of these workloads for which expert-written codes were available to us, Vyasa demonstrated a geometric mean performance improvement of 1.10x with 50x smaller code relative to the expert-written codes.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Context-Aware Parse Trees
Authors:
Fangke Ye,
Shengtian Zhou,
Anand Venkat,
Ryan Marcus,
Paul Petersen,
Jesmin Jahan Tithi,
Tim Mattson,
Tim Kraska,
Pradeep Dubey,
Vivek Sarkar,
Justin Gottschlich
Abstract:
The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}. This is a departure from the classical abstract syntax tree, which is principally driven by programming language syntax. While we believe a semantics-driven repres…
▽ More
The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}. This is a departure from the classical abstract syntax tree, which is principally driven by programming language syntax. While we believe a semantics-driven representation is desirable, the specifics of an SPT's construction can impact its performance. We analyze these nuances and present a new tree structure, heavily influenced by Aroma's SPT, called a \emph{context-aware parse tree} (CAPT). CAPT enhances SPT by providing a richer level of semantic representation. Specifically, CAPT provides additional binding support for language-specific techniques for adding semantically-salient features, and language-agnostic techniques for removing syntactically-present but semantically-irrelevant features. Our research quantitatively demonstrates the value of our proposed semantically-salient features, enabling a specific CAPT configuration to be 39\% more accurate than SPT across the 48,610 programs we analyzed.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators
Authors:
Prasanth Chatarasi,
Hyoukjun Kwon,
Natesh Raina,
Saurabh Malik,
Vaisakh Haridas,
Angshuman Parashar,
Michael Pellauer,
Tushar Krishna,
Vivek Sarkar
Abstract:
The efficiency of a spatial DNN accelerator depends heavily on the compiler and its cost model ability to generate optimized map**s for various operators of DNN models on to the accelerator's compute and memory resources. But, existing cost models lack a formal boundary over the operators for precise and tractable analysis, which poses adaptability challenges for new DNN operators. To address th…
▽ More
The efficiency of a spatial DNN accelerator depends heavily on the compiler and its cost model ability to generate optimized map**s for various operators of DNN models on to the accelerator's compute and memory resources. But, existing cost models lack a formal boundary over the operators for precise and tractable analysis, which poses adaptability challenges for new DNN operators. To address this challenge, we leverage the recently introduced Maestro Data-Centric (MDC) notation. We develop a formal understanding of DNN operators whose map**s can be described in the MDC notation, because any map** adhering to the notation is always analyzable by the MDC's cost model. Furthermore, we introduce a transformation for translating map**s into the MDC notation for exploring the map** space.
Searching for the optimal map**s is challenging because of the large space of map**s, and this challenge gets exacerbated with new operators and diverse accelerator configurations.To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the map** space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. The motivation for this decomposition is to reduce the size of the search space dramatically and also to prioritize the optimization of off-chip data movement, which is 2-3 orders of magnitude more compared to the on-chip data movement. We implemented our approach in a tool called {\em Marvel}, and another major benefit of our approach is that it is applicable to any DNN operator conformable with the MDC notation.
△ Less
Submitted 11 June, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO
Authors:
Hyoukjun Kwon,
Prasanth Chatarasi,
Michael Pellauer,
Angshuman Parashar,
Vivek Sarkar,
Tushar Krishna
Abstract:
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large…
▽ More
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration (DSE) experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.
△ Less
Submitted 11 May, 2020; v1 submitted 4 May, 2018;
originally announced May 2018.
-
A survey of sparse matrix-vector multiplication performance on large matrices
Authors:
Max Grossman,
Christopher Thiele,
Mauricio Araya-Polo,
Florian Frank,
Faruk O. Alpak,
Vivek Sarkar
Abstract:
We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO…
▽ More
We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO, HYB, and ELL matrix formats (supported by each software package).
△ Less
Submitted 1 August, 2016;
originally announced August 2016.
-
Formalization of Phase Ordering
Authors:
Tiago Cogumbreiro,
Jun Shirako,
Vivek Sarkar
Abstract:
Phasers pose an interesting synchronization mechanism that generalizes many collective synchronization patterns seen in parallel programming languages, including barriers, clocks, and point-to-point synchronization using latches or semaphores. This work characterizes scheduling constraints on phaser operations, by relating the execution state of two tasks that operate on the same phaser. We propos…
▽ More
Phasers pose an interesting synchronization mechanism that generalizes many collective synchronization patterns seen in parallel programming languages, including barriers, clocks, and point-to-point synchronization using latches or semaphores. This work characterizes scheduling constraints on phaser operations, by relating the execution state of two tasks that operate on the same phaser. We propose a formalization of Habanero phasers, May-Happen-In-Parallel, and Happens-Before relations for phaser operations, and show that these relations conform with the semantics. Our formalization and proofs are fully mechanized using the Coq proof assistant, and are available online.
△ Less
Submitted 19 June, 2016;
originally announced June 2016.
-
Finding Tizen security bugs through whole-system static analysis
Authors:
Daniel Song,
Jisheng Zhao,
Michael Burke,
Dragoş Sbîrlea,
Dan Wallach,
Vivek Sarkar
Abstract:
Tizen is a new Linux-based open source platform for consumer devices including smartphones, televisions, vehicles, and wearables. While Tizen provides kernel-level mandatory policy enforcement, it has a large collection of libraries, implemented in a mix of C and C++, which make their own security checks. In this research, we describe the design and engineering of a static analysis engine which dr…
▽ More
Tizen is a new Linux-based open source platform for consumer devices including smartphones, televisions, vehicles, and wearables. While Tizen provides kernel-level mandatory policy enforcement, it has a large collection of libraries, implemented in a mix of C and C++, which make their own security checks. In this research, we describe the design and engineering of a static analysis engine which drives a full information flow analysis for apps and a control flow analysis for the full library stack. We implemented these static analyses as extensions to LLVM, requiring us to improve LLVM's native analysis features to get greater precision and scalability, including knotty issues like the coexistence of C++ inheritance with C function pointer use. With our tools, we found several unexpected behaviors in the Tizen system, including paths through the system libraries that did not have inline security checks. We show how our tools can help the Tizen app store to verify important app properties as well as hel** the Tizen development process avoid the accidental introduction of subtle vulnerabilities.
△ Less
Submitted 22 April, 2015;
originally announced April 2015.
-
ADHA: Automatic Data layout framework for Heterogeneous Architectures
Authors:
Deepak Majeti,
Kuldeep S. Meel,
Rajkishore Barik,
Vivek Sarkar
Abstract:
Data layouts play a crucial role in determining the performance of a given application running on a given architecture. Existing parallel programming frameworks for both multicore and heterogeneous systems leave the onus of selecting a data layout to the programmer. Therefore, shifting the burden of data layout selection to optimizing compilers can greatly enhance programmer productivity and appli…
▽ More
Data layouts play a crucial role in determining the performance of a given application running on a given architecture. Existing parallel programming frameworks for both multicore and heterogeneous systems leave the onus of selecting a data layout to the programmer. Therefore, shifting the burden of data layout selection to optimizing compilers can greatly enhance programmer productivity and application performance. In this work, we introduce {\ADHA}: a two-level hierarchal formulation of the data layout problem for modern heterogeneous architectures. We have created a reference implementation of ADHA in the Heterogeneous Habanero-C (H2C) parallel programming system. ADHA shows significant performance benefits of up to 6.92$\times$ compared to manually specified layouts for two benchmark programs running on a CPU+GPU heterogeneous platform.
△ Less
Submitted 17 July, 2014;
originally announced July 2014.