Search | arXiv e-print repository

BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Authors: Hamdy Abdelkhalik, Shamminuj Aktar, Yehia Arafa, Atanu Barai, Gopinath Chennupati, Nandakishore Santhi, Nishant Panda, Nirmal Prajapati, Nazmul Haque Turja, Stephan Eidenbenz, Abdel-Hameed Badawy

Abstract: Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a… ▽ More Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%. △ Less

Submitted 11 November, 2023; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: Accepted at the 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2023)

arXiv:1802.01957 [pdf, other]

Analytical Cost Metrics : Days of Future Past

Authors: Nirmal Prajapati, Sanjay Rajopadhye, Hristo Djidjev

Abstract: As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtua… ▽ More As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?" The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign. △ Less

Submitted 5 February, 2018; originally announced February 2018.

arXiv:1802.00166 [pdf, other]

PCOT: Cache Oblivious Tiling of Polyhedral Programs

Authors: Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, Sanjay Rajopadhye

Abstract: This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one over the other. The answer to this question is complicated for modern architecture due to a number of reasons. In this paper, we present a detailed empirical stu… ▽ More This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one over the other. The answer to this question is complicated for modern architecture due to a number of reasons. In this paper, we present a detailed empirical study to answer this question for a range of kernels that fit the polyhedral model. Our study is based on a generalized cache oblivious code generator that support this class, which is a superset of those supported by existing tools. The conclusion is that cache oblivious code is most useful when the aim is to have reduced off-chip memory accesses, e.g., lower energy, albeit certain situations that diminish its effectiveness exist. △ Less

Submitted 1 February, 2018; originally announced February 2018.

arXiv:1801.05909 [pdf, other]

Scheduling and Tiling Reductions on Realistic Machines

Authors: Nirmal Prajapati

Abstract: Computations, where the number of results is much smaller than the input data and are produced through some sort of accumulation, are called Reductions. Reductions appear in many scientific applications. Usually, reductions admit an associative and commutative binary operator over accumulation. Reductions are therefore highly parallel. Given unbounded fan-in, one can execute a reduction in constan… ▽ More Computations, where the number of results is much smaller than the input data and are produced through some sort of accumulation, are called Reductions. Reductions appear in many scientific applications. Usually, reductions admit an associative and commutative binary operator over accumulation. Reductions are therefore highly parallel. Given unbounded fan-in, one can execute a reduction in constant/linear time provided that the data is available. However, due to the fact that real machines have bounded fan-in, accumulations cannot be performed in one time step and have to be broken into parts. Thus, a (partial) serialization of reductions becomes necessary. This makes scheduling reductions a difficult and interesting problem. There have been a number of research works in the context of scheduling reductions. We focus on the scheduling techniques presented in Gupta et al., identify a potential issue in their scheduling algorithm and provide a solution. In addition, we demonstrate how these scheduling techniques can be extended to "tile" reductions and briefly survey other studies that address the problem of scheduling reductions. △ Less

Submitted 17 January, 2018; originally announced January 2018.

arXiv:1712.04892 [pdf, other]

Accelerator Codesign as Non-Linear Optimization

Authors: Nirmal Prajapati, Sanjay Rajopadhye, Hristo Djidjev, Nandkishore Santhi, Tobias Grosser, Rumen Andonov

Abstract: We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterizati… ▽ More We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of 'all the hardware and software parameters'. The solution to this problem, therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance. We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement. △ Less

Submitted 13 December, 2017; originally announced December 2017.

Comments: 10 pages, 4 figures, 2 tables

arXiv:1610.07236 [pdf, other]

Hybrid Static/Dynamic Schedules for Tiled Polyhedral Programs

Authors: Tian **, Nirmal Prajapati, Waruna Ranasinghe, Guillaume Iooss, Yun Zou, Sanjay Rajopadhye, David Wonnacott

Abstract: Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at run-time with da… ▽ More Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at run-time with data flow mechanisms. We prove bounds on the added overheads that are better, by at least one polynomial degree, than those of previous techniques. We propose a generic mechanism to implement the needed synchronization, and show it can be easily realized for a variety of targets: OpenMP, Pthreads, GPU (CUDA or OpenCL) code, languages like X10, Habanero, Cilk, as well as data flow platforms like DAGuE, and OpenStream and MPI. We also provide a simple concrete implementation that works without the need of any sophisticated run-time mechanism. Our experiments show our simple implementation to be competitive or better than the wavefront-synchronized code generated by other systems. We also show how the proposed mechanism can achieve 24% to 70% reduction in energy. △ Less

Submitted 23 October, 2016; originally announced October 2016.

arXiv:1006.3848 [pdf]

doi 10.5121/jgraphoc.2010.2202

Case Study On Social Engineering Techniques for Persuasion

Authors: Mosin Hasan, Nilesh Prajapati, Safvan Vohara

Abstract: There are plenty of security software in market; each claiming the best, still we daily face problem of viruses and other malicious activities. If we know the basic working principal of such malware then we can very easily prevent most of them even without security software. Hackers and crackers are experts in psychology to manipulate people into giving them access or the information necessary to… ▽ More There are plenty of security software in market; each claiming the best, still we daily face problem of viruses and other malicious activities. If we know the basic working principal of such malware then we can very easily prevent most of them even without security software. Hackers and crackers are experts in psychology to manipulate people into giving them access or the information necessary to get access. This paper discusses the inner working of such attacks. Case study of Spyware is provided. In this case study, we got 100% success using social engineering techniques for deception on Linux operating system, which is considered as the most secure operating system. Few basic principal of defend, for the individual as well as for the organization, are discussed here, which will prevent most of such attack if followed. △ Less

Submitted 19 June, 2010; originally announced June 2010.

Comments: 7 Pages

Journal ref: International journal on applications of graph theory in wireless ad hoc networks and sensor networks 2.2 (2010) 17-23

arXiv:1003.3553 [pdf]

doi 10.5121/jgraphhoc.2010.2101

Simulated Annealing for Location Area Planning in Cellular networks

Authors: N. B. Prajapati, R. R. Agravat, M. I. Hasan

Abstract: LA planning in cellular network is useful for minimizing location management cost in GSM network. In fact, size of LA can be optimized to create a balance between the LA update rate and expected paging rate within LA. To get optimal result for LA planning in cellular network simulated annealing algorithm is used. Simulated annealing give optimal results in acceptable run-time. LA planning in cellular network is useful for minimizing location management cost in GSM network. In fact, size of LA can be optimized to create a balance between the LA update rate and expected paging rate within LA. To get optimal result for LA planning in cellular network simulated annealing algorithm is used. Simulated annealing give optimal results in acceptable run-time. △ Less

Submitted 18 March, 2010; originally announced March 2010.

Comments: 7 Pages, JGraph-Hoc Journal

Journal ref: International journal on applications of graph theory in wireless ad hoc networks and sensor networks 2.1 (2010) 1-7

Showing 1–8 of 8 results for author: Prajapati, N