Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Authors:
Sungsoo Ha,
Wonyong Jeong,
Gyorgy Matyasfalvi,
Cong Xie,
Kevin Huck,
Jong Youl Choi,
Abid Malik,
Li Tang,
Hubertus Van Dam,
Line Pouchard,
Wei Xu,
Shinjae Yoo,
Nicholas D'Imperio,
Kerstin Kleese Van Dam
Abstract:
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance tra…
▽ More
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework, and we demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
Efficient Distributed-Memory Parallel Matrix-Vector Multiplication with Wide or Tall Unstructured Sparse Matrices
Authors:
Jonathan Eckstein,
Gyorgy Matyasfalvi
Abstract:
This paper presents an efficient technique for matrix-vector and vector-transpose-matrix multiplication in distributed-memory parallel computing environments, where the matrices are unstructured, sparse, and have a substantially larger number of columns than rows or vice versa. Our method allows for parallel I/O, does not require extensive preprocessing, and has the same communication complexity a…
▽ More
This paper presents an efficient technique for matrix-vector and vector-transpose-matrix multiplication in distributed-memory parallel computing environments, where the matrices are unstructured, sparse, and have a substantially larger number of columns than rows or vice versa. Our method allows for parallel I/O, does not require extensive preprocessing, and has the same communication complexity as matrix-vector multiplies with column or row partitioning. Our implementation of the method uses MPI. We partition the matrix by individual nonzero elements, rather than by row or column, and use an "overlapped" vector representation that is matched to the matrix. The transpose multiplies use matrix-specific MPI communicators and reductions that we show can be set up in an efficient manner. The proposed technique achieves a good work per processor balance even if some of the columns are dense, while kee** communication costs relatively low.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.