-
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
Authors:
George Michelogiannakis,
Yehia Arafa,
Brandon Cook,
Liang Yuan Dai,
Abdel Hameed Badawy,
Madeleine Glick,
Yuyang Wang,
Keren Bergman,
John Shalf
Abstract:
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth a…
▽ More
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.
△ Less
Submitted 17 July, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis
Authors:
Hamdy Abdelkhalik,
Yehia Arafa,
Nandakishore Santhi,
Abdel-Hameed Badawy
Abstract:
Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more effi…
▽ More
Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Authors:
Hamdy Abdelkhalik,
Shamminuj Aktar,
Yehia Arafa,
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Nishant Panda,
Nirmal Prajapati,
Nazmul Haque Turja,
Stephan Eidenbenz,
Abdel-Hameed Badawy
Abstract:
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a…
▽ More
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.
△ Less
Submitted 11 November, 2023; v1 submitted 15 February, 2022;
originally announced February 2022.
-
PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling
Authors:
Atanu Barai,
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's l…
▽ More
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections. We model Intel's Broadwell, Haswell, and AMD's Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed Badawy,
Yehia Arafa,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic…
▽ More
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic and computationally-efficient method to predict the reuse distance profiles of caches in multicores. SASMM relies on a stochastic, static basic block-level analysis of reuse profiles. The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces. The experiments show that our model can predict private L1 cache hit rates with 2.12% and shared L2 cache hit rates with about 1.50% error rate.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs
Authors:
Yehia Arafa,
Ammar ElWazir,
Abdelrahman ElKanishy,
Youssef Aly,
Ayatelrahman Elsayed,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Stephan Eidenbenz,
Nandakishore Santhi
Abstract:
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of…
▽ More
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of more than 40 instructions for four high-end NVIDIA GPUs from four different generations (Maxwell, Pascal, Volta, and Turing). Furthermore, we show the effect of the CUDA compiler optimizations on the energy consumption of each instruction. We use three different software techniques to read the GPU on-chip power sensors, which use NVIDIA's NVML API and provide an in-depth comparison between these techniques. Additionally, we verified the software measurement techniques against a custom-designed hardware power measurement. The results show that Volta GPUs have the best energy efficiency of all the other generations for the different categories of the instructions. This work should aid in understanding NVIDIA GPUs' microarchitecture. It should also make energy measurements of any GPU kernel both efficient and accurate.
△ Less
Submitted 2 June, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs
Authors:
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed ch…
▽ More
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.
△ Less
Submitted 1 September, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.