Search | arXiv e-print repository

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

Authors: Kylee Santos, Stan Moore, Tomas Oppelstrup, Amirali Sharifian, Ilya Sharapov, Aidan Thompson, Delyan Z Kalchev, Danny Perez, Robert Schreiber, Scott Pakin, Edgar A Leon, James H Laros III, Michael James, Sivasankaran Rajamanickam

Abstract: Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the C… ▽ More Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the Cerebras Wafer-Scale Engine. By dedicating a processor core for each simulated atom, we demonstrate a 179-fold improvement in timesteps per second versus the Frontier GPU-based Exascale platform, along with a large improvement in timesteps per unit energy. Reducing every year of runtime to two days unlocks currently inaccessible timescales of slow microstructure transformation processes that are critical for understanding material behavior and function. Our dataflow algorithm runs Embedded Atom Method (EAM) simulations at rates over 270,000 timesteps per second for problems with up to 800k atoms. This demonstrated performance is unprecedented for general-purpose processing cores. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 10 pages, 10 figures, 5 tables

arXiv:2304.13194 [pdf, other]

Jet: Multilevel Graph Partitioning on Graphics Processing Units

Authors: Michael S. Gilbert, Kamesh Madduri, Erik G. Boman, Sivasankaran Rajamanickam

Abstract: The multilevel heuristic is the dominant strategy for high-quality sequential and parallel graph partitioning. Partition refinement is a key step of multilevel graph partitioning. In this work, we present Jet, a new parallel algorithm for partition refinement specifically designed for Graphics Processing Units (GPUs). We combine Jet with GPU-aware coarsening to develop a $k$-way graph partitioner,… ▽ More The multilevel heuristic is the dominant strategy for high-quality sequential and parallel graph partitioning. Partition refinement is a key step of multilevel graph partitioning. In this work, we present Jet, a new parallel algorithm for partition refinement specifically designed for Graphics Processing Units (GPUs). We combine Jet with GPU-aware coarsening to develop a $k$-way graph partitioner, the Jet partitioner. The new partitioner achieves superior quality compared to state-of-the-art shared memory partitioners on a large collection of test graphs. △ Less

Submitted 5 January, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: To appear in SIAM SISC journal

arXiv:2304.04876 [pdf, other]

An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs

Authors: Ichitaro Yamazaki, Alexander Heinlein, Sivasankaran Rajamanickam

Abstract: The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlap** Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlap** Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the sol… ▽ More The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlap** Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlap** Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy. The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlap** subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about $2\times$ using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Accepted for publication in IPDPS'23

arXiv:2303.11499 [pdf, other]

Exploiting Inter-Operation Data Reuse in Scientific Applications using GOGETA

Authors: Raveesh Garg, Michael Pellauer, Sivasankaran Rajamanickam, Tushar Krishna

Abstract: HPC applications are critical in various scientific domains ranging from molecular dynamics to chemistry to fluid dynamics. Conjugate Gradient (CG) is a popular application kernel used in iterative linear HPC solvers and has applications in numerous scientific domains. However, the HPCG benchmark shows that the peformance achieved by Top500 HPC systems on CG is a small fraction of the performance… ▽ More HPC applications are critical in various scientific domains ranging from molecular dynamics to chemistry to fluid dynamics. Conjugate Gradient (CG) is a popular application kernel used in iterative linear HPC solvers and has applications in numerous scientific domains. However, the HPCG benchmark shows that the peformance achieved by Top500 HPC systems on CG is a small fraction of the performance achieved on Linpack. While CG applications have significant portions of computations that are dense and sparse matrix multiplications, skewed SpMMs/GEMMs in the HPC solvers have poor arithmetic intensities which makes their execution highly memory bound unlike GEMMs in DNNs which have high arithmetic intensity. The problem of low intensity individual skewed GEMMs also exists in various emerging workloads from other domains like Graph Neural Networks, Transformers etc. In this work we identify various reuse opportunities between the tensors in these solver applications to extract reuse in the entire Directed Acyclic Graph of the tensor operations rather than individual tensor operations. These opportunities essentially depend on the dimensions of the tensors and the structure of the tensor dependency graph. We propose a systematic methodology to determine various kinds of reuse opportunities in the graph of operations and determine the loop order and tiling in the interdependent operations. As a result, we propose a novel map** strategy GOGETA that improves reuse of HPC applications on spatial accelerators. We also propose a data organization strategy in the buffer. Our map** achieves geomean 6.7x reduction in memory accesses. △ Less

Submitted 20 March, 2023; originally announced March 2023.

arXiv:2209.04541 [pdf, other]

PGAbB: A Block-Based Graph Processing Framework for Heterogeneous Platforms

Authors: Abdurrahman Yasar, Sivasankaran Rajamanickam, Jonathan W. Berry, Umit V. Catalyurek

Abstract: Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based pr… ▽ More Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based programming model. This allows a user to express a graph algorithm using kernels that operate on subgraphs. PGAbB support graph computations that fit in host DRAM but not in GPU device memory, and provides simple but effective scheduling techniques to schedule computations to all available resources in a heterogeneous architecture. We have demonstrated that one can easily implement a diverse set of graph algorithms in our framework by develo** five algorithms. Our experimental results show that PGAbB implementations achieve better or competitive performance compared to hand-optimized implementations. Based on our experiments on five graph algorithms and forty-four graphs, in the median, PGAbB achieves 1.6, 1.6, 5.7, 3.4, 4.5, and 2.4 times better performance than GAPBS, Galois, Ligra, LAGraph Galois-GPU, and Gunrock graph processing systems, respectively. △ Less

Submitted 9 September, 2022; originally announced September 2022.

arXiv:2204.02934 [pdf, other]

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

Authors: Brian Kelley, Sivasankaran Rajamanickam

Abstract: Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performanc… ▽ More Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performance on modern accelerator hardware. This algorithm is implemented using the Kokkos programming model to enable performance portability. We demonstrate the portability of the algorithm and the performance on a variety of architectures (x86/ARM CPUs and NVIDIA/AMD GPUs). The resulting algorithm is also deterministic, producing an identical result for a given input across all of these platforms. The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results. We further demonstrate the benefits of this approach by develo** parallel graph coarsening scheme for two different use cases. First, we develop an algebraic multigrid (AMG) aggregation scheme using parallel MIS-2 and demonstrate the benefits as opposed to previous approaches used in the MueLu multigrid package in Trilinos. We also describe an approach for implementing a parallel multicolor "cluster" Gauss-Seidel preconditioner using this MIS-2 coarsening, and demonstrate better performance with an efficient, parallel, multicolor Gauss-Seidel algorithm. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: Accepted for publication in IPDPS 2022

arXiv:2201.08916 [pdf, other]

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Authors: Eric Qin, Raveesh Garg, Abhimanyu Bambhaniya, Michael Pellauer, Angshuman Parashar, Sivasankaran Rajamanickam, Cong Hao, Tushar Krishna

Abstract: Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size a… ▽ More Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size and sparsity of the workloads heavily influence which architecture is best for memory and computation efficiency. To satisfy the growing demand of efficient computations across a spectrum of workloads on large data centers, we propose deploying a flexible 'heterogeneous' accelerator, which contains many 'sub-accelerators' (smaller specialized accelerators) working together. To this end, we propose: (1) HARD TACO, a quick and productive C++ to RTL design flow to generate many types of sub-accelerators for sparse and dense computations for fair design-space exploration, (2) AESPA, a heterogeneous sparse accelerator design template constructed with the sub-accelerators generated from HARD TACO, and (3) a suite of scheduling strategies to map tensor kernels onto heterogeneous sparse accelerators with high efficiency and utilization. AESPA with optimized scheduling achieves 1.96X higher performance, and 7.9X better energy-delay product (EDP) than a Homogeneous EIE-like accelerator with our diverse workload suite. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2109.07419 [pdf, other]

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

Authors: Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna

Abstract: To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are s… ▽ More To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and map** approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW co-design ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their map**s on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel map** abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different map** schemes. △ Less

Submitted 6 November, 2021; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: This paper is accepted to PACT 2021

arXiv:2109.01232 [pdf, other]

A Study of Mixed Precision Strategies for GMRES on GPUs

Authors: Jennifer A. Loe, Christian A. Glusa, Ichitaro Yamazaki, Erik G. Boman, Sivasankaran Rajamanickam

Abstract: Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision stra… ▽ More Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of mixed precision strategies for accelerating this kernel on an NVIDIA V$100$ GPU with a Power 9 CPU. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when mixed precision GMRES will be effective and to choose parameters for a mixed precision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of mixed precision approaches and demonstrate even further improvements are possible by optimizing low-level kernels. △ Less

Submitted 2 September, 2021; originally announced September 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2105.07544

arXiv:2107.00075 [pdf, other]

Parallel Graph Coloring Algorithms for Distributed GPU Environments

Authors: Ian Bogle, Erik G Boman, Karen D Devine, Sivasankaran Rajamanickam, George M Slota

Abstract: Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but to the best of our knowledge, hybrid MPI+GPU algorithms have been unexplored until this work. We present several MPI+GP… ▽ More Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but to the best of our knowledge, hybrid MPI+GPU algorithms have been unexplored until this work. We present several MPI+GPU coloring approaches based on the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. . The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to compute distance-2 and partial distance-2 colorings, giving the first known distributed, multi-GPU algorithm for these problems. In addition, we propose a novel heuristic to reduce communication for recoloring in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs. △ Less

Submitted 30 June, 2021; originally announced July 2021.

Comments: Submitted to Parallel Computing

arXiv:2106.10499 [pdf, other]

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

Authors: Gordon E. Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, Tushar Krishna

Abstract: There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strat… ▽ More There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized map**s (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM map**s systematically generated by our framework achieve high performance on various GEMM workloads and accelerators. △ Less

Submitted 19 June, 2021; originally announced June 2021.

arXiv:2105.07544 [pdf, other]

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

Authors: Jennifer A. Loe, Christian A. Glusa, Ichitaro Yamazaki, Erik G. Boman, Sivasankaran Rajamanickam

Abstract: Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strat… ▽ More Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels. △ Less

Submitted 16 May, 2021; originally announced May 2021.

Comments: Accepted for publication in the IEEE IPDPS Accelerators and Hybrid Emerging Systems (AsHES) 11th Workshop, 2021

arXiv:2105.00578 [pdf, other]

Sphynx: a parallel multi-GPU graph partitioner for distributed-memory systems

Authors: Seher Acer, Erik G Boman, Christian A Glusa, Sivasankaran Rajamanickam

Abstract: Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory parallel, multi-G… ▽ More Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner. △ Less

Submitted 2 May, 2021; originally announced May 2021.

Comments: To appear in Parallel Computing

Report number: SAND2021-0352-O MSC Class: 68W10

arXiv:2104.01196 [pdf, other]

Two-Stage Gauss--Seidel Preconditioners and Smoothers for Krylov Solvers on a GPU cluster

Authors: Luc Berger-Vergiat, Brian Kelley, Sivasankaran Rajamanickam, Jonathan Hu, Katarzyna Swirydowicz, Paul Mullowney, Stephen Thomas, Ichitaro Yamazaki

Abstract: Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the traditional GS relaxation based on a triangular solve is compared with two-s… ▽ More Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the traditional GS relaxation based on a triangular solve is compared with two-stage variants, replacing the direct triangular solve with a fixed number of inner Jacobi-Richardson (JR) iterations. When a small number of inner iterations is sufficient to maintain the Krylov convergence rate, the two-stage GS (GS2) often outperforms the traditional algorithm on many-core architectures. We also compare GS2 with JR. When they perform the same number of flops for SpMV (e.g. three JR sweeps compared to two GS sweeps with one inner JR sweep), the GS2 iterations, and the Krylov solver preconditioned with GS2, may converge faster than the JR iterations. Moreover, for some problems (e.g. elasticity), it was found that JR may diverge with a dam** factor of one, whereas two-stage GS may improve the convergence with more inner iterations. Finally, to study the performance of the two-stage smoother and preconditioner for a practical problem, %(e.g. using tuned dam** factors), these were applied to incompressible fluid flow simulations on GPUs. △ Less

Submitted 24 April, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

arXiv:2103.11991 [pdf, other]

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Authors: Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Dang, Nathan Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, Ichitaro Yamazaki

Abstract: As hardware architectures are evolving in the push towards exascale, develo** Computational Science and Engineering (CSE) applications depend on performance portable approaches for sustainable software development. This paper describes one aspect of performance portability with respect to develo** a portable library of kernels that serve the needs of several CSE applications and software frame… ▽ More As hardware architectures are evolving in the push towards exascale, develo** Computational Science and Engineering (CSE) applications depend on performance portable approaches for sustainable software development. This paper describes one aspect of performance portability with respect to develo** a portable library of kernels that serve the needs of several CSE applications and software frameworks. We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels. We describe the design principles of such a library and demonstrate portable performance of the library using some selected kernels. Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Report number: SAND2021-3421 O

arXiv:2103.10484 [pdf, other]

Concentric Spherical GNN for 3D Representation Learning

Authors: James Fox, Bo Zhao, Sivasankaran Rajamanickam, Rampi Ramprasad, Le Song

Abstract: Learning 3D representations that generalize well to arbitrarily oriented inputs is a challenge of practical importance in applications varying from computer vision to physics and chemistry. We propose a novel multi-resolution convolutional architecture for learning over concentric spherical feature maps, of which the single sphere representation is a special case. Our hierarchical architecture is… ▽ More Learning 3D representations that generalize well to arbitrarily oriented inputs is a challenge of practical importance in applications varying from computer vision to physics and chemistry. We propose a novel multi-resolution convolutional architecture for learning over concentric spherical feature maps, of which the single sphere representation is a special case. Our hierarchical architecture is based on alternatively learning to incorporate both intra-sphere and inter-sphere information. We show the applicability of our method for two different types of 3D inputs, mesh objects, which can be regularly sampled, and point clouds, which are irregularly distributed. We also propose an efficient map** of point clouds to concentric spherical images, thereby bridging spherical convolutions on grids with general point clouds. We demonstrate the effectiveness of our approach in improving state-of-the-art performance on 3D classification tasks with rotated data. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: This paper has been submitted for conference review

arXiv:2103.10452 [pdf]

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

Authors: Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, Sudarshan Srinivasan, Dipankar Das, Gordon E. Moon, Sivasankaran Rajamanickam, Tushar Krishna

Abstract: Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enab… ▽ More Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ~4X speedup over performing format conversions in software. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: Accepted for publication at the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)

arXiv:2103.07977 [pdf, other]

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Authors: Raveesh Garg, Eric Qin, Francisco Muñoz-Martínez, Robert Guirado, Akshay Jain, Sergi Abadal, José L. Abellán, Manuel E. Acacio, Eduard Alarcón, Sivasankaran Rajamanickam, Tushar Krishna

Abstract: Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelera… ▽ More Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelerators offers promise for acceleration by map** optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for mappers or design-space exploration tools to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for map** the dense and sparse phases of GNN inference, spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution. △ Less

Submitted 6 March, 2022; v1 submitted 14 March, 2021; originally announced March 2021.

Comments: Accepted for publication at the 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022)

arXiv:2012.12871 [pdf, other]

A Multimodal Framework for the Detection of Hateful Memes

Authors: Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, Helen Yannakoudakis

Abstract: An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes. Designing systems to automatically detect hateful content is of paramount importance if we are to mitigate its undesirable effects on the society at large. The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and… ▽ More An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes. Designing systems to automatically detect hateful content is of paramount importance if we are to mitigate its undesirable effects on the society at large. The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and text and, hence, require multimodal reasoning and joint visual and language understanding. In this work, we seek to advance this line of research and develop a multimodal framework for the detection of hateful memes. We improve the performance of existing multimodal approaches beyond simple fine-tuning and, among others, show the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning based on cross-validation to improve robustness. We furthermore analyze model misclassifications and discuss a number of hypothesis-driven augmentations and their effects on performance, presenting important implications for future research in the field. Our best approach comprises an ensemble of UNITER-based models and achieves an AUROC score of 80.53, placing us 4th on phase 2 of the 2020 Hateful Memes Challenge organized by Facebook. △ Less

Submitted 24 December, 2020; v1 submitted 23 December, 2020; originally announced December 2020.

Journal ref: PMLR 133:344-360, 2021

arXiv:2010.04905 [pdf, other]

doi 10.1103/PhysRevB.104.035120

Accelerating Finite-temperature Kohn-Sham Density Functional Theory with Deep Neural Networks

Authors: J. Austin Ellis, Lenz Fiedler, Gabriel A. Popoola, Normand A. Modine, J. Adam Stephens, Aidan P. Thompson, Attila Cangi, Sivasankaran Rajamanickam

Abstract: We present a numerical modeling workflow based on machine learning (ML) which reproduces the the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spat… ▽ More We present a numerical modeling workflow based on machine learning (ML) which reproduces the the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spatially-resolved, energy-resolved, and integrated quantities can be calculated, including the DFT total free energy, which serves as the Born-Oppenheimer potential energy surface for the atoms. We demonstrate the efficacy of this approach for both solid and liquid metals and compare results between independent and unified machine-learning models for solid and liquid aluminum. Our machine-learning density functional theory framework opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms. △ Less

Submitted 9 July, 2021; v1 submitted 10 October, 2020; originally announced October 2020.

Journal ref: Phys. Rev. B 104, 035120 (2021)

arXiv:2009.12457 [pdf, other]

A Block-Based Triangle Counting Algorithm on Heterogeneous Environments

Authors: Abdurrahman Yaşar, Sivasankaran Rajamanickam, Jonathan Berry, Ümit V. Çatalyürek

Abstract: Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task deco… ▽ More Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time. △ Less

Submitted 25 September, 2020; originally announced September 2020.

arXiv:2007.06674 [pdf, other]

A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic

Authors: Ahmad Abdelfattah, Hartwig Anzt, Erik G. Boman, Erin Carson, Terry Cojean, Jack Dongarra, Mark Gates, Thomas Grützmacher, Nicholas J. Higham, Sherry Li, Neil Lindquist, Yang Liu, Jennifer Loe, Piotr Luszczek, Pratik Nayak, Sri Pranesh, Siva Rajamanickam, Tobias Ribizel, Barry Smith, Kasia Swirydowicz, Stephen Thomas, Stanimire Tomov, Yaohung M. Tsai, Ichitaro Yamazaki, Urike Meier Yang

Abstract: Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more t… ▽ More Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more than an order of magnitude higher performance than what is available in IEEE double precision. At the same time, the gap between the compute power on the one hand and the memory bandwidth on the other hand keeps increasing, making data access and communication prohibitively expensive compared to arithmetic operations. To start the multiprecision focus effort, we survey the numerical linear algebra community and summarize all existing multiprecision knowledge, expertise, and software capabilities in this landscape analysis report. We also include current efforts and preliminary results that may not yet be considered "mature technology," but have the potential to grow into production quality within the multiprecision focus effort. As we expect the reader to be familiar with the basics of numerical linear algebra, we refrain from providing a detailed background on the algorithms themselves but focus on how mixed- and multiprecision technology can help improving the performance of these methods and present highlights of application significantly outperforming the traditional fixed precision methods. △ Less

Submitted 13 July, 2020; originally announced July 2020.

Comments: Technical report as a part of the Exascale computing project (ECP)

ACM Class: G.1.3; G.4

arXiv:2005.14028 [pdf, other]

Joint Modelling of Emotion and Abusive Language Detection

Authors: Santhosh Rajamanickam, Pushkar Mishra, Helen Yannakoudakis, Ekaterina Shutova

Abstract: The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the… ▽ More The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the linguistic properties of the comments and the online communities of users, disregarding the emotional state of the users and how this might affect their language. The latter is, however, inextricably linked to abusive behaviour. In this paper, we present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework that allows one task to inform the other. Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

arXiv:1912.10206 [pdf, other]

How Robust Are Graph Neural Networks to Structural Noise?

Authors: James Fox, Sivasankaran Rajamanickam

Abstract: Graph neural networks (GNNs) are an emerging model for learning graph embeddings and making predictions on graph structured data. However, robustness of graph neural networks is not yet well-understood. In this work, we focus on node structural identity predictions, where a representative GNN model is able to achieve near-perfect accuracy. We also show that the same GNN model is not robust to addi… ▽ More Graph neural networks (GNNs) are an emerging model for learning graph embeddings and making predictions on graph structured data. However, robustness of graph neural networks is not yet well-understood. In this work, we focus on node structural identity predictions, where a representative GNN model is able to achieve near-perfect accuracy. We also show that the same GNN model is not robust to addition of structural noise, through a controlled dataset and set of experiments. Finally, we show that under the right conditions, graph-augmented training is capable of significantly improving robustness to structural noise. △ Less

Submitted 21 December, 2019; originally announced December 2019.

Comments: Accepted workshop paper at Deep Learning on Graphs: Methodologies and Applications (DLGMA'20)

arXiv:1808.08172 [pdf, other]

Asynchronous One-Level and Two-Level Domain Decomposition Solvers

Authors: Christian Glusa, Paritosh Ramanan, Erik G. Boman, Edmond Chow, Sivasankaran Rajamanickam

Abstract: Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes on more heterogeneous systems make load balancing and network layout very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchr… ▽ More Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes on more heterogeneous systems make load balancing and network layout very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchronous communication based on one-sided MPI primitives in a multitude of domain decomposition solvers. In particular, a scalable asynchronous two-level method is presented. We discuss practical issues encountered in the development of a scalable solver and show experimental results obtained on state-of-the-art supercomputer systems that illustrate the benefits of asynchronous solvers in load balanced as well as load imbalanced scenarios. Using the novel method, we can observe speed-ups of up to 4x over its classical synchronous equivalent. △ Less

Submitted 10 August, 2020; v1 submitted 24 August, 2018; originally announced August 2018.

MSC Class: 68W10; 65Y05; 68W15; 65N55

arXiv:1804.09798 [pdf, other]

Geometric Partitioning and Ordering Strategies for Task Map** on Parallel Computers

Authors: Mehmet Deveci, Karen D. Devine, Kevin Pedretti, Mark A. Taylor, Sivasankaran Rajamanickam, Umit V. Catalyurek

Abstract: We present a new method for map** applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well… ▽ More We present a new method for map** applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby" cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We also present a number of algorithmic optimizations that exploit specific features of the network or application. We show that, for the structured finite difference mini-application MiniGhost, our map** methods reduced communication time up to 75% relative to MiniGhost's default map** on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 32K cores of an IBM BlueGene/Q with contiguous allocation. △ Less

Submitted 25 April, 2018; originally announced April 2018.

Report number: SAND2018-4335R

arXiv:1804.00695 [pdf, other]

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Authors: Mehmet Deveci, Simon D. Hammond, Michael M. Wolf, Sivasankaran Rajamanickam

Abstract: Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures… ▽ More Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures -- Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms. Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as KNLs where the memory subsystems have similar latencies. However, for architectures such as GPUs where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Report number: SAND2018-3428 R

arXiv:1801.03065 [pdf, other]

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Authors: Mehmet Deveci, Christian Trott, Sivasankaran Rajamanickam

Abstract: Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architect… ▽ More Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved. △ Less

Submitted 9 January, 2018; originally announced January 2018.

Report number: SAND2018-0186 R

arXiv:1712.07297 [pdf, other]

A distributed-memory hierarchical solver for general sparse linear systems

Authors: Chao Chen, Hadi Pouransari, Sivasankaran Rajamanickam, Erik G. Boman, Eric Darve

Abstract: We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct s… ▽ More We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm. △ Less

Submitted 19 December, 2017; originally announced December 2017.

MSC Class: 65F50

arXiv:1701.00503 [pdf, other]

Distributed Graph Layout for Scalable Small-world Network Analysis

Authors: George M Slota, Sivasankaran Rajamanickam, Kamesh Madduri

Abstract: The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and r… ▽ More The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we present DGL, a fast, parallel, and memory-efficient distributed graph layout strategy that is specifically designed for small-world networks (low-diameter graphs with skewed vertex degree distributions). Label propagation-based partitioning and a scalable BFS-based ordering are the main steps in the layout strategy. We show that the DGL layout can significantly improve end-to-end performance of five challenging graph analytics workloads: PageRank, a parallel subgraph enumeration program, tuned implementations of breadth-first search and single-source shortest paths, and RDF3X-MPI, a distributed SPARQL query processing engine. Using these benchmarks, we additionally offer a comprehensive analysis on how graph layout affects the performance of graph analytics with variable computation and communication characteristics. △ Less

Submitted 2 January, 2017; originally announced January 2017.

arXiv:1610.07220 [pdf, other]

Partitioning Trillion-edge Graphs in Minutes

Authors: George M Slota, Sivasankaran Rajamanickam, Karen Devine, Kamesh Madduri

Abstract: We introduce XtraPuLP, a new distributed-memory graph partitioner designed to process trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated as a viable means to produce high quality partitions with minimal computation time. On a collection of large sparse graphs, we show that XtraPuLP partitioning quality is comparable… ▽ More We introduce XtraPuLP, a new distributed-memory graph partitioner designed to process trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated as a viable means to produce high quality partitions with minimal computation time. On a collection of large sparse graphs, we show that XtraPuLP partitioning quality is comparable to state-of-the-art partitioning methods. We also demonstrate that XtraPuLP can produce partitions of real-world graphs with billion+ vertices in minutes. Further, we show that using XtraPuLP partitions for distributed-memory graph analytics leads to significant end-to-end execution time reduction. △ Less

Submitted 23 October, 2016; originally announced October 2016.

arXiv:1601.05871 [pdf, other]

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Authors: Kyungjoo Kim, Sivasankaran Rajamanickam, George Stelle, H. Carter Edwards, Stephen L. Olivier

Abstract: We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-by-blocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in t… ▽ More We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-by-blocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Cholesky-by-blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems. △ Less

Submitted 21 January, 2016; originally announced January 2016.

Comments: 25 pages

Report number: SAND2016-0637 R MSC Class: 68W10

arXiv:1601.05725 [pdf, other]

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Authors: Joshua Dennis Booth, Sivasankaran Rajamanickam, Heidi K. Thornquist

Abstract: Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to m… ▽ More Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of 5.91x on CPU (16 cores) and 7.4x on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as 53x on CPU (16 cores) and 13.3x on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4x speedup on a challenging matrix sequence taken from an actual Xyce simulation. △ Less

Submitted 21 January, 2016; originally announced January 2016.

arXiv:1511.03703 [pdf, other]

Embedded Ensemble Propagation for Improving Performance, Portability and Scalability of Uncertainty Quantification on Emerging Computational Architectures

Authors: E. Phipps, M. D'Elia, H. C. Edwards, M. Hoemmen, J. Hu, S. Rajamanickam

Abstract: Quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large n… ▽ More Quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan). △ Less

Submitted 11 November, 2015; originally announced November 2015.

Report number: SAND2015-9921 J

Showing 1–34 of 34 results for author: Rajamanickam, S