Search | arXiv e-print repository

arXiv:2404.19331 [pdf, other]

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

Authors: Fareed Qararyah, Muhammad Waqar Azhar, Mohammad Ali Maleki, Pedro Trancoso

Abstract: Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the perform… ▽ More Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2305.05388 [pdf, other]

doi 10.1145/3587135.3592175

VEDLIoT -- Next generation accelerated AIoT systems and applications

Authors: Kevin Mika, René Griessl, Nils Kucza, Florian Porrmann, Martin Kaiser, Lennart Tigges, Jens Hagemeyer, Pedro Trancoso, Muhammad Waqar Azhar, Fareed Qararyah, Stavroula Zouzoula, Jämes Ménétrey, Marcelo Pasin, Pascal Felber, Carina Marcus, Oliver Brunnegard, Olof Eriksson, Hans Salomonsson, Daniel Ödman, Andreas Ask, Antonio Casimiro, Alysson Bessani, Tiago Carvalho, Karol Gugala, Piotr Zierhoffer , et al. (7 additional authors not shown)

Abstract: The VEDLIoT project aims to develop energy-efficient Deep Learning methodologies for distributed Artificial Intelligence of Things (AIoT) applications. During our project, we propose a holistic approach that focuses on optimizing algorithms while addressing safety and security challenges inherent to AIoT systems. The foundation of this approach lies in a modular and scalable cognitive IoT hardware… ▽ More The VEDLIoT project aims to develop energy-efficient Deep Learning methodologies for distributed Artificial Intelligence of Things (AIoT) applications. During our project, we propose a holistic approach that focuses on optimizing algorithms while addressing safety and security challenges inherent to AIoT systems. The foundation of this approach lies in a modular and scalable cognitive IoT hardware platform, which leverages microserver technology to enable users to configure the hardware to meet the requirements of a diverse array of applications. Heterogeneous computing is used to boost performance and energy efficiency. In addition, the full spectrum of hardware accelerators is integrated, providing specialized ASICs as well as FPGAs for reconfigurable computing. The project's contributions span across trusted computing, remote attestation, and secure execution environments, with the ultimate goal of facilitating the design and deployment of robust and efficient AIoT systems. The overall architecture is validated on use-cases ranging from Smart Home to Automotive and Industrial IoT appliances. Ten additional use cases are integrated via an open call, broadening the range of application areas. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: This publication incorporates results from the VEDLIoT project, which received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 957197. CF'23: 20th ACM International Conference on Computing Frontiers, May 2023, Bologna, Italy

Journal ref: CF'23: 20th ACM International Conference on Computing Frontiers, May 2023, Bologna, Italy

arXiv:2008.08636 [pdf, other]

doi 10.1016/j.parco.2021.102792

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Authors: Fareed Qararyah, Mohamed Wahib, Doğa Dikbayır, Mehmet Esat Belviranli, Didem Unat

Abstract: Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so… ▽ More Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work. △ Less

Submitted 5 May, 2021; v1 submitted 19 August, 2020; originally announced August 2020.

Showing 1–3 of 3 results for author: Qararyah, F