Parallel Algorithms for Masked Sparse Matrix-Matrix Products
Authors:
Srđan Milaković,
Oguz Selvitopi,
Israt Nisa,
Zoran Budimlić,
Aydin Buluc
Abstract:
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the out…
▽ More
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be "masked out". Existing algorithms for Masked SpGEMM usually do not consider mask as part of multiplication and either first compute a regular SpGEMM followed by masking, or perform a sparse inner product only for output elements that are not masked out. In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures. Our evaluations show that factors such as matrix and mask density, mask structure and cache behavior play a vital role in attaining high performance for Masked SpGEMM. We evaluate our algorithms on a large number of matrices using several real-world benchmarks and show that our algorithms in most cases significantly outperform the state of the art for Masked SpGEMM implementations.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
Parallel Binary Code Analysis
Authors:
Xiaozhu Meng,
Jonathon M. Anderson,
John Mellor-Crummey,
Mark W. Krentel,
Barton P. Miller,
Srđan Milaković
Abstract:
Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as…
▽ More
Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as performance analysis and software forensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batches that contain thousands of binaries.
This paper describes our design and implementation for accelerating the task of constructing control flow graphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging code constructs encountered during constructing CFGs, including functions sharing code, jump table analysis, non-returning functions, and tail calls. However, existing analyses do not consider the complex interactions between concurrent analysis of shared code, making it difficult to extend existing serial algorithms to be parallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the task of constructing CFGs as repeated applications of several core CFG operations regarding to creating functions, basic blocks, and edges. We then derive properties among CFG operations, including operation dependency, commutativity, monotonicity. These operation properties guide our design of a new parallel analysis for constructing CFGs. We achieved as much as 25$\times$ speedup for constructing CFGs on 64 hardware threads. Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8$\times$ for a performance analysis tool and 7$\times$ for a software forensic tool with 16 hardware threads.
△ Less
Submitted 16 May, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.