Search | arXiv e-print repository

doi 10.1145/2967938.2967945

Fusion of Array Operations at Runtime

Authors: Mads R. B. Kristensen, Simon A. F. Lund, Troels Blum, James Avery

Abstract: We address the problem of fusing array operations based on criteria such as shape compatibility, data reusability, and communication. We formulate the problem as a graph partition problem that is general enough to handle loop fusion, combinator fusion, and other types of subroutines. We address the problem of fusing array operations based on criteria such as shape compatibility, data reusability, and communication. We formulate the problem as a graph partition problem that is general enough to handle loop fusion, combinator fusion, and other types of subroutines. △ Less

Submitted 21 January, 2016; v1 submitted 20 January, 2016; originally announced January 2016.

Comments: Preprint

Journal ref: Proceeding PACT '16 Proceedings of the 2016 International Conference on Parallel Architectures and Compilation Pages 71-85

arXiv:1210.7774 [pdf, other]

cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications

Authors: Mads Ruben Burgdorff Kristensen, Simon Andreas Frimann Lund, Troels Blum, Brian Vinter

Abstract: Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that… ▽ More Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that enables vector oriented high-level programming languages to map onto a broad range of architectures efficiently. The idea is to close the gap between high-level languages and hardware optimized low-level implementations. By translating high-level vector operations into an intermediate vector bytecode, cphVB enables specialized vector engines to efficiently execute the vector operations. The primary success parameters are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, modern, processors. We evaluate the presented design through a setup that targets multi-core CPU architectures. We evaluate the performance of the implementation using Python implementations of well-known algorithms: a jacobi solver, a kNN search, a shallow water simulation and a synthetic stencil simulation. All demonstrate good performance. △ Less

Submitted 25 March, 2013; v1 submitted 26 October, 2012; originally announced October 2012.

Journal ref: Proceedings of The 11th Python In Science Conference (SciPy 2012)

arXiv:1201.3804 [pdf, ps, other]

doi 10.1109/HPCC.2012.80

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Authors: Mads Ruben Burgdorff Kristensen, Brian Vinter

Abstract: This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between sc… ▽ More This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application. △ Less

Submitted 18 January, 2012; originally announced January 2012.

Comments: PREPRINT

Journal ref: Proceeding HPCC '12 Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Showing 1–3 of 3 results for author: Kristensen, M R B