On Performance Analysis of Graphcore IPUs: Analyzing Squared and Skewed Matrix Multiplication
Authors:
S. -Kazem Shekofteh,
Christian Alles,
Nils Kochendörfer,
Holger Fröning
Abstract:
In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while kee** power consumption within reasonable limits. The Intelligence Processing Unit (IPU) represents an entirely novel category of massively parallel processors, meticulously designed to expedite parallel…
▽ More
In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while kee** power consumption within reasonable limits. The Intelligence Processing Unit (IPU) represents an entirely novel category of massively parallel processors, meticulously designed to expedite parallel computations through a multitude of processing cores and on-chip memory components interconnected via high-speed fabrics. While IPUs are primarily tailored for machine learning applications and come equipped with several libraries for the seamless implementation of neural networks, they also retain the capability to execute traditional parallel programs like matrix multiplication. However, it is essential to acknowledge that there are certain considerations and limitations when utilizing IPUs for such tasks. This paper embarks on an extensive analytical examination of matrix multiplications (MM) executed on an IPU, focusing on aspects such as execution efficiency and memory usage. Additionally, a comparative analysis is conducted, pitting the IPU against a GPU. Our findings indicate that IPUs can outperform modern GPUs, especially in handling the consistently challenging skewed matrix multiplication operations. For a more comprehensive understanding, we scrutinize various aspect ratios of matrices for these operations on an IPU and a Turing-class GPU (RTX 2080TI), revealing that the IPU consistently delivers more robust performance when dealing with skewed matrices compared to a GPU.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
Reducing Memory Requirements for the IPU using Butterfly Factorizations
Authors:
S. -Kazem Shekofteh,
Christian Alles,
Holger Fröning
Abstract:
High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-…
▽ More
High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. In this paper, we examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU. Experimental results indicate that these methods can provide 98.5% compression ratio to decrease the immense need for memory, the IPU implementation can benefit from 1.3x and 1.6x performance improvement for butterfly and pixelated butterfly, respectively. We also reach to 1.62x training time speedup on a real-word dataset such as CIFAR10.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.