-
LLVM Static Analysis for Program Characterization and Memory Reuse Profile Estimation
Authors:
Atanu Barai,
Nandakishore Santhi,
Abdur Razzak,
Stephan Eidenbenz,
Abdel-Hameed A. Badawy
Abstract:
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estim…
▽ More
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estimates the reuse distance profile of a program by analyzing the LLVM IR file in constant time, regardless of program input size. We generate the basic-block-level control flow graph of the target application kernel and determine basic-block execution counts by solving the linear balance equation involving the adjacent basic blocks' transition probabilities. Finally, we represent the kernel memory accesses in a bracketed format and employ a recursive algorithm to calculate the reuse distance profile. The results show that our approach can predict application characteristics accurately compared to another LLVM-based dynamic code analysis tool, Byfl.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Authors:
Hamdy Abdelkhalik,
Shamminuj Aktar,
Yehia Arafa,
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Nishant Panda,
Nirmal Prajapati,
Nazmul Haque Turja,
Stephan Eidenbenz,
Abdel-Hameed Badawy
Abstract:
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a…
▽ More
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.
△ Less
Submitted 11 November, 2023; v1 submitted 15 February, 2022;
originally announced February 2022.
-
PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling
Authors:
Atanu Barai,
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's l…
▽ More
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections. We model Intel's Broadwell, Haswell, and AMD's Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed Badawy,
Yehia Arafa,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic…
▽ More
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic and computationally-efficient method to predict the reuse distance profiles of caches in multicores. SASMM relies on a stochastic, static basic block-level analysis of reuse profiles. The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces. The experiments show that our model can predict private L1 cache hit rates with 2.12% and shared L2 cache hit rates with about 1.50% error rate.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed A. Badawy,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hier…
▽ More
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy. This model uses a computationally efficient, probabilistic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. It relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications ran sequentially on small instances rather than using a multi-threaded trace. The results indicate that the hit-rate predictions on the shared cache are accurate.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
Development of a Device for Remote Monitoring of Heart Rate and Body Temperature
Authors:
Mohammad Ashekur Rahman,
Atanu Barai,
Md. Asadul Islam,
M. M. A Hashem
Abstract:
We present a new integrated, portable device to provide a convenient solution for remote monitoring heart rate at the fingertip and body temperature using Ethernet technology and widely spreading internet. Now a days, heart related disease is rising. Most of the times in these cases, patients may not realize their actual conditions and even it is a common fact that there are no doctors by their si…
▽ More
We present a new integrated, portable device to provide a convenient solution for remote monitoring heart rate at the fingertip and body temperature using Ethernet technology and widely spreading internet. Now a days, heart related disease is rising. Most of the times in these cases, patients may not realize their actual conditions and even it is a common fact that there are no doctors by their side, especially in rural areas, but now a days most of the diseases are curable if detected in time.
We have tried to make a system which may give information about one's physical condition and help him or her to detect these deadly but curable diseases. The system gives information of heart rate and body temperature simultaneously acquired on the portable side in real time and transmits results to web. In this system, the condition of heart and body temperature can be monitored from remote places. Eventually, this device provides a low cost, easily accessible human health monitor solution bridging the gaps between patients and doctors.
△ Less
Submitted 31 March, 2013;
originally announced April 2013.