Measurement and Analysis of GPU-accelerated Applications with HPCToolkit
Authors:
Keren Zhou,
Laksono Adhianto,
Jonathon Anderson,
Aaron Cherian,
Dejan Grubisic,
Mark Krentel,
Yumeng Liu,
Xiaozhu Meng,
John Mellor-Crummey
Abstract:
To address the challenge of performance analysis on the US DOE's forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit's measurement and analysis tools attribute metrics to calling co…
▽ More
To address the challenge of performance analysis on the US DOE's forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit's measurement and analysis tools attribute metrics to calling contexts that span both CPUs and GPUs. To measure GPU-accelerated applications efficiently, HPCToolkit employs a novel wait-free data structure to coordinate monitoring and attribution of GPU performance. To help developers understand the performance of complex GPU code generated from high-level programming models, HPCToolkit constructs sophisticated approximations of call path profiles for GPU computations. To support fine-grained analysis and tuning, HPCToolkit uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code. To supplement fine-grained measurements, HPCToolkit can measure GPU kernel executions using hardware performance counters. To provide a view of how an execution evolves over time, HPCToolkit can collect, analyze, and visualize call path traces within and across nodes. Finally, on NVIDIA GPUs, HPCToolkit can derive and attribute a collection of useful performance metrics based on measurements using GPU PC samples. We illustrate HPCToolkit's new capabilities for analyzing GPU-accelerated applications with several codes developed as part of the Exascale Computing Project.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
Parallel Binary Code Analysis
Authors:
Xiaozhu Meng,
Jonathon M. Anderson,
John Mellor-Crummey,
Mark W. Krentel,
Barton P. Miller,
Srđan Milaković
Abstract:
Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as…
▽ More
Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as performance analysis and software forensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batches that contain thousands of binaries.
This paper describes our design and implementation for accelerating the task of constructing control flow graphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging code constructs encountered during constructing CFGs, including functions sharing code, jump table analysis, non-returning functions, and tail calls. However, existing analyses do not consider the complex interactions between concurrent analysis of shared code, making it difficult to extend existing serial algorithms to be parallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the task of constructing CFGs as repeated applications of several core CFG operations regarding to creating functions, basic blocks, and edges. We then derive properties among CFG operations, including operation dependency, commutativity, monotonicity. These operation properties guide our design of a new parallel analysis for constructing CFGs. We achieved as much as 25$\times$ speedup for constructing CFGs on 64 hardware threads. Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8$\times$ for a performance analysis tool and 7$\times$ for a software forensic tool with 16 hardware threads.
△ Less
Submitted 16 May, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.