Search | arXiv e-print repository

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Authors: Kazuaki Matsumura, Simon Garcia De Gonzalo, Antonio J. Peña

Abstract: Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equa… ▽ More Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design. In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach simultaneously realizes less computation, less memory access, and high memory throughput. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while kee** dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs. △ Less

Submitted 23 June, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

arXiv:2301.11389 [pdf, other]

doi 10.1145/3578360.3580253

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Authors: Kazuaki Matsumura, Simon Garcia De Gonzalo, Antonio J. Peña

Abstract: Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from… ▽ More Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: To appear in: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC '23)

arXiv:2110.14340 [pdf, other]

doi 10.1109/HiPC53243.2021.00032

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

Authors: Kazuaki Matsumura, Simon Garcia De Gonzalo, Antonio J. Peña

Abstract: The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential… ▽ More The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers. This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while kee** original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications. △ Less

Submitted 27 April, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: Extended version of a paper to appear in: Proceedings of the 28th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), December 17-18, 2021

arXiv:2106.12485 [pdf, other]

doi 10.1007/978-3-030-85665-6_30

Particle-In-Cell Simulation using Asynchronous Tasking

Authors: Nicolas Guidotti, Pedro Ceyrat, João Barreto, José Monteiro, Rodrigo Rodrigues, Ricardo Fonseca, Xavier Martorell, Antonio J. Peña

Abstract: Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages wh… ▽ More Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores. △ Less

Submitted 29 August, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: Published on the 27th European Conference on Parallel and Distributed Computing (Euro-Par 2021)

Journal ref: Euro-Par 2021: Parallel Processing. Lecture Notes in Computer Science, vol 12820, pp. 482-498

arXiv:2103.16234 [pdf, ps, other]

doi 10.1007/s10586-021-03494-y

cuConv: A CUDA Implementation of Convolution for CNN Inference

Authors: Marc Jordà, Pedro Valero-Lara, Antonio J. Peña

Abstract: Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper w… ▽ More Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations, with speedups of up to 2.29x with respect to the best implementation of convolution in cuDNN, hence covering a relevant region in currently existing approaches. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: This work has been submitted to the Springer for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Journal ref: Cluster Comput (2022)

arXiv:2103.16139 [pdf, other]

doi 10.1109/TC.2021.3076123

Enabling Homomorphically Encrypted Inference for Large DNN Models

Authors: Guillermo Lloret-Talavera, Marc Jorda, Harald Servat, Fabian Boemer, Chetan Chauhan, Shigeki Tomishima, Nilesh N. Shah, Antonio J. Peña

Abstract: The proliferation of machine learning services in the last few years has raised data privacy concerns. Homomorphic encryption (HE) enables inference using encrypted data but it incurs 100x-10,000x memory and runtime overheads. Secure deep neural network (DNN) inference using HE is currently limited by computing and memory resources, with frameworks requiring hundreds of gigabytes of DRAM to evalua… ▽ More The proliferation of machine learning services in the last few years has raised data privacy concerns. Homomorphic encryption (HE) enables inference using encrypted data but it incurs 100x-10,000x memory and runtime overheads. Secure deep neural network (DNN) inference using HE is currently limited by computing and memory resources, with frameworks requiring hundreds of gigabytes of DRAM to evaluate small models. To overcome these limitations, in this paper we explore the feasibility of leveraging hybrid memory systems comprised of DRAM and persistent memory. In particular, we explore the recently-released Intel Optane PMem technology and the Intel HE-Transformer nGraph to run large neural networks such as MobileNetV2 (in its largest variant) and ResNet-50 for the first time in the literature. We present an in-depth analysis of the efficiency of the executions with different hardware and software configurations. Our results conclude that DNN inference using HE incurs on friendly access patterns for this memory configuration, yielding efficient executions. △ Less

Submitted 29 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: Manuscript accepted for publication in IEEE Transactions on Computers

arXiv:2005.06332 [pdf, other]

doi 10.1016/j.parco.2019.03.006

MPI+OpenMP Tasking Scalability for Multi-Morphology Simulations of the Human Brain

Authors: Pedro Valero-Lara, Raül Sirvent, Antonio J. Peña, Jesús Labarta

Abstract: The simulation of the behavior of the human brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work, we focus on the most important European initiative (the Human Brain Project) and on one of the models developed in this project.… ▽ More The simulation of the behavior of the human brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work, we focus on the most important European initiative (the Human Brain Project) and on one of the models developed in this project. This tool simulates the spikes triggered in a neural network by computing the voltage capacitance on the neurons' morphology, being one of the most precise simulators today. In the present work, we have evaluated the use of MPI+OpenMP tasking on top of this framework. We prove that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node. One of our targets consists of achieving not only a highly scalable implementation, but also to develop a tool with a high degree of abstraction without losing control and performance by using \emph{MPI+OpenMP} tasking. The main motivation of this work is the evaluation of this cutting-edge simulation on multi-morphology neural networks. The simulation of a high number of neurons, which are completely different among them, is an important challenge. In fact, in the multi-morphology simulations, we find an important unbalancing between the nodes, mainly due to the differences in the neurons, which causes an important under-utilization of the available resources. In this work, the authors present and evaluate mechanisms to deal with this and reduce the time of this kind of simulations considerably. △ Less

Submitted 13 May, 2020; originally announced May 2020.

Journal ref: P. Valero-Lara, R. Sirvent, A. J. Peña, and J. Labarta. "MPI+OpenMP tasking scalability for multi-morphology simulations of the human brain", Parallel Computing, Elsevier, vol. 84, pp. 50-61, May 2019

arXiv:2005.05910 [pdf, other]

doi 10.1016/j.parco.2018.07.006

DMR API: Improving cluster productivity by turning applications into malleable

Authors: Sergio Iserte, Rafael Mayo, Enrique S. Quintana-Orti, Vicenc Beltran, Antonio J. Peña

Abstract: Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the… ▽ More Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the resource allocation---and the parallel runtime---able to transparently handle the processes and the program data---is crucial for our throughput-aware malleability methodology. Hence, when a job triggers a reconfiguration, the resource manager will check the cluster status and return an action: an expansion, if there are spare resources; a shrink, if queued jobs can be initiated; or none, if no change can improve the global productivity. In this paper, we describe the internals of our framework and how it is capable of reducing the global workload completion time along with providing a smarter usage of the underlying resources. For this purpose, we present a thorough study of the adaptive workloads processing by showing the detailed behavior of our framework in representative experiments and the low overhead that our reconfiguration involves. △ Less

Submitted 28 May, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

Journal ref: S. Iserte, R. Mayo, E. S. Quintana-Orti, V. Beltran, and A. J. Peña, "DMR API: Improving cluster productivity by turning applications into malleable", Parallel Computing, Elsevier, vol. 78, pp. 54-66, Oct. 2018

arXiv:2005.05872 [pdf, other]

doi 10.1016/j.parco.2018.06.007

Understanding Memory Access Patterns Using the BSC Performance Tools

Authors: Harald Servat, Jesús Labarta, Hans-Christian Hoppe, Judit Giménez, Antonio J. Peña

Abstract: The growing gap between processor and memory speeds results in complex memory hierarchies as processors evolve to mitigate such divergence by taking advantage of the locality of reference. In this direction, the BSC performance analysis tools have been recently extended to provide insight relative to the application memory accesses depicting their temporal and spatial characteristics, correlating… ▽ More The growing gap between processor and memory speeds results in complex memory hierarchies as processors evolve to mitigate such divergence by taking advantage of the locality of reference. In this direction, the BSC performance analysis tools have been recently extended to provide insight relative to the application memory accesses depicting their temporal and spatial characteristics, correlating with the source-code and the achieved performance simultaneously. These extensions rely on the Precise Event-Based Sampling (PEBS) mechanism available in recent Intel processors to capture information regarding the application memory accesses. The sampled information is later combined with the Folding technique to represent a detailed temporal evolution of the memory accesses and in conjunction with the achieved performance and the source-code counterpart. The results obtained from the combination of these tools help not only application developers but also processor architects to understand better how the application behaves and how the system performs. In this paper, we describe a tighter integration of the sampling mechanism into the monitoring package. We also demonstrate the value of the complete workflow by exploring already optimized state--of--the--art benchmarks, providing detailed insight of their memory access behavior. We have taken advantage of this insight to apply small modifications that improve the applications' performance. △ Less

Submitted 28 May, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

Journal ref: H. Servat, J. Labarta, H. C. Hoppe, J. Giménez, and A. J. Peña, "Understanding memory access patterns using the BSC performance tools", Parallel Computing, Elsevier, vol. 78, pp. 1-14, Oct. 2018

arXiv:1901.03271 [pdf, ps, other]

doi 10.1016/j.parco.2018.12.008

Integrating Blocking and Non-Blocking MPI Primitives with Task-Based Programming Models

Authors: Kevin Sala, Xavier Teruel, Josep M. Perez, Antonio J. Peña, Vicenç Beltran, Jesus Labarta

Abstract: In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improv… ▽ More In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime system's ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks. △ Less

Submitted 29 May, 2020; v1 submitted 10 January, 2019; originally announced January 2019.

Comments: European Commission's projects: INTERTWinE (EC-H2020-671602), Marie Skłodowska-Curie (EC-H2020-749516). Postprint submitted to the Parallel Computing Journal (Elsevier). Figures from section 7.2 updated, typos corrected

Journal ref: Parallel Computing, 85, 153-166 (2019)

arXiv:1810.04150 [pdf, other]

Exploring the Vision Processing Unit as Co-processor for Inference

Authors: Sergio Rivas-Gomez, Antonio J. Peña, David Moloney, Erwin Laure, Stefano Markidis

Abstract: The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore t… ▽ More The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore the so-called Vision Processing Unit (VPU), a highly-parallel vector processor with a power envelope of less than 1W. We evaluate this chip during inference using a pre-trained GoogLeNet convolutional network model and a large image dataset from the ImageNet ILSVRC challenge. Preliminary results indicate that a multi-VPU configuration provides similar performance compared to reference CPU and GPU implementations, while reducing the thermal-design power (TDP) up to 8x in comparison. △ Less

Submitted 9 October, 2018; originally announced October 2018.

Showing 1–11 of 11 results for author: Peña, A J