Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Huang, Guyue; Li, Haoran; Qin, Minghai; Sun, Fei; Ding, Yufei; Xie, Yuan

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2203.05016 (cs)

[Submitted on 9 Mar 2022 (v1), last revised 12 Mar 2022 (this version, v2)]

Title:Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Authors:Guyue Huang, Haoran Li, Minghai Qin, Fei Sun, Yufei Ding, Yuan Xie

View PDF

Abstract:Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.
In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

Comments:	To-appear in Design Automation Conference (DAC), July 2022
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2203.05016 [cs.DC]
	(or arXiv:2203.05016v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2203.05016

Submission history

From: Guyue Huang [view email]
[v1] Wed, 9 Mar 2022 19:49:06 UTC (1,992 KB)
[v2] Sat, 12 Mar 2022 03:39:11 UTC (1,992 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators