Search | arXiv e-print repository

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Authors: Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

Abstract: Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance i… ▽ More Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%). △ Less

Submitted 27 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

Comments: 12 pages, 11 figures, 4 tables

arXiv:2404.11591 [pdf, other]

The EDGE Language: Extended General Einsums for Graph Algorithms

Authors: Toluwanimi O. Odemuyiwa, Joel S. Emer, John D. Owens

Abstract: In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor,… ▽ More In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor, and (2) the power and expressivity of Einsum notation in the tensor algebra world. In this work, we describe our design goals for EDGE and walk through the extensions we add to Einsums to support more complex operations common in graph algorithms. Additionally, we provide a few examples of how to express graph algorithms in our proposed notation. We hope that a single, mathematical notation for graph algorithms will (1) allow researchers to more easily compare different algorithms and different implementations of a graph algorithm; (2) enable developers to factor complexity by separating the concerns of what to compute (described with the extended Einsum notation) from the lower level details of how to compute; and (3) enable the discovery of different algorithmic variants of a problem through algebraic manipulations and transformations on a given EDGE expression. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 79 pages, 14 figures

arXiv:2310.00496 [pdf, other]

The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks

Authors: Cameron Shinn, Collin McCarthy, Saurav Muralidharan, Muhammad Osama, John D. Owens

Abstract: We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and theoretical inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the theoretical speedup becomes equal to the actual speedup when the corresponding dense and sparse kernels ar… ▽ More We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and theoretical inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the theoretical speedup becomes equal to the actual speedup when the corresponding dense and sparse kernels are well-optimized. We achieve this through a novel analytical model for predicting sparse network performance, and validate the predicted speedup using several real-world computer vision architectures pruned across a range of sparsity patterns and degrees. We demonstrate the utility and ease-of-use of our model through two case studies: (1) we show how machine learning researchers can predict the performance of unimplemented or unoptimized block-structured sparsity patterns, and (2) we show how hardware designers can predict the performance implications of new sparsity patterns and sparse data formats in hardware. In both scenarios, the Sparsity Roofline helps performance experts identify sparsity regimes with the highest performance potential. △ Less

Submitted 6 November, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2306.10410 [pdf, other]

BOBA: A Parallel Lightweight Graph Reordering Algorithm with Heavyweight Implications

Authors: Matthew Drescher, Muhammad A. Awad, Serban D. Porumbescu, John D. Owens

Abstract: We describe a simple parallel-friendly lightweight graph reordering algorithm for COO graphs (edge lists). Our ``Batched Order By Attachment'' (BOBA) algorithm is linear in the number of edges in terms of reads and linear in the number of vertices for writes through to main memory. It is highly parallelizable on GPUs\@. We show that, compared to a randomized baseline, the ordering produced gives… ▽ More We describe a simple parallel-friendly lightweight graph reordering algorithm for COO graphs (edge lists). Our ``Batched Order By Attachment'' (BOBA) algorithm is linear in the number of edges in terms of reads and linear in the number of vertices for writes through to main memory. It is highly parallelizable on GPUs\@. We show that, compared to a randomized baseline, the ordering produced gives improved locality of reference in sparse matrix-vector multiplication (SpMV) as well as other graph algorithms. Moreover, it can substantially speed up the conversion from a COO representation to the compressed format CSR, a very common workflow. Thus, it can give \emph{end-to-end} speedups even in SpMV\@. Unlike other lightweight approaches, this reordering does not rely on explicitly knowing the degrees of the vertices, and indeed its runtime is comparable to that of computing degrees. Instead, it uses the structure and edge distribution inherent in the input edge list, making it a candidate for default use in a pragmatic graph creation pipeline. This algorithm is suitable for road-type networks as well as scale-free. It improves cache locality on both CPUs and GPUs, achieving hit rates similar to the heavyweight techniques (e.g., for SpMV, 7--52\% and 11--67\% in the L1 and L2 caches, respectively). Compared to randomly labeled graphs, BOBA-reordered graphs achieve end-to-end speedups of up to 3.45. The reordering time is approximately one order of magnitude faster than existing lightweight techniques and up to 2.5 orders of magnitude faster than heavyweight techniques. △ Less

Submitted 21 June, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

arXiv:2301.04792 [pdf, other]

doi 10.1145/3572848.3577434

A Programming Model for GPU Load Balancing

Authors: Muhammad Osama, Serban D. Porumbescu, John D. Owens

Abstract: We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-… ▽ More We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques. With our open-source framework for load-balancing, we hope to improve programmers' productivity when develo** irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier. △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: This work previously appeared in the author's PhD dissertation, available at arXiv:2212.08964 Also published in the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '23)

arXiv:2301.03598 [pdf, other]

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

Authors: Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens

Abstract: We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless o… ▽ More We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14$\times$ and 6.7$\times$, and an average performance response that is both higher and more consistent across 32,824 GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS. Furthermore, we achieve this performance from a single tile size configuration per floating-point precision, whereas today's math libraries employ complex kernel-selection heuristics to select from a large ensemble of kernel variants. △ Less

Submitted 9 January, 2023; originally announced January 2023.

Comments: This work previously appeared in the author's PhD dissertation, available at arXiv:2212.08964

arXiv:2212.08200 [pdf, ps, other]

doi 10.1109/IPDPSW55747.2022.00061

Essentials of Parallel Graph Analytics

Authors: Muhammad Osama, Serban D. Porumbescu, John D. Owens

Abstract: We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these essential components, we propose an abstraction that captures all the significant programming models within graph analytics, such as bulk-synchronous, asynchronous, shared-memory, messa… ▽ More We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these essential components, we propose an abstraction that captures all the significant programming models within graph analytics, such as bulk-synchronous, asynchronous, shared-memory, message-passing, and push vs. pull traversals. Finally, we demonstrate the power of our abstraction with an elegant modern C++ implementation of single-source shortest path and its required components. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning

arXiv:2201.07821 [pdf, other]

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Authors: Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens

Abstract: We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-b… ▽ More We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods. △ Less

Submitted 16 November, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: 11 pages, 11 figures. Appears in the 29th IEEE International Conference on High-Performance Computing, Data, and Analytics (HiPC 2022)

arXiv:2112.00132 [pdf, other]

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

Authors: Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, John D. Owens

Abstract: We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems wit… ▽ More We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice. We evaluate and analyze the performance of Atos vs. BSP on three applications: breadth-first search, PageRank, and graph coloring. Atos implementations achieve geomean speedups of 3.44x, 2.1x, and 2.77x and peak speedups of 12.8x, 3.2x, and 9.08x across three case studies, compared to a state-of-the-art BSP GPU implementation. Beyond simply quantifying the speedup, we extensively analyze the reasons behind each speedup. This deeper understanding allows us to derive general guidelines for how to select the optimal Atos configuration for different applications. Finally, our analysis provides insights for future dynamic scheduling framework designs. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: 12 pages, 4 figures

arXiv:2109.14682 [pdf, other]

doi 10.1145/3543866

Supporting Unified Shader Specialization by Co-opting C++ Features

Authors: Kerry A. Seitz Jr., Theresa Foley, Serban D. Porumbescu, John D. Owens

Abstract: Modern unified programming models (such as CUDA and SYCL) that combine host (CPU) code and GPU code into the same programming language, same file, and same lexical scope lack adequate support for GPU code specialization, which is a key optimization in real-time graphics. Furthermore, current methods used to implement specialization do not translate to a unified environment. In this paper, we creat… ▽ More Modern unified programming models (such as CUDA and SYCL) that combine host (CPU) code and GPU code into the same programming language, same file, and same lexical scope lack adequate support for GPU code specialization, which is a key optimization in real-time graphics. Furthermore, current methods used to implement specialization do not translate to a unified environment. In this paper, we create a unified shader programming environment in C++ that provides first-class support for specialization by co-opting C++'s attribute and virtual function features and reimplementing them with alternate semantics to express the services required. By co-opting existing features, we enable programmers to use familiar C++ programming techniques to write host and GPU code together, while still achieving efficient generated C++ and HLSL code via our source-to-source translator. △ Less

Submitted 16 July, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: 17 pages, 1 figure, 2 tables, 3 code listings. To be published in Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5, 3, Article 25 (July 2022)

ACM Class: I.3.6; D.3.2; D.3.4

Journal ref: Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5, 3, Article 25 (July 2022), 17 pages

arXiv:2108.07232 [pdf, other]

Better GPU Hash Tables

Authors: Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, Martín Farach-Colton, John D. Owens

Abstract: We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table's different operations. Our results show that a bucketed cuckoo hash table that uses… ▽ More We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table's different operations. Our results show that a bucketed cuckoo hash table that uses three hash functions (BCHT) outperforms alternative methods that use power-of-two choices, iceberg hashing, and a cuckoo hash table that uses a bucket size one. At high load factors as high as 0.99, BCHT enjoys an average probe count of 1.43 during insertion. Using three hash functions only, positive and negative queries require at most 1.39 and 2.8 average probes per key, respectively. △ Less

Submitted 17 December, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: Our implementation is available at https://github.com/owensgroup/BGHT

arXiv:2010.03759 [pdf, other]

Energy-based Out-of-distribution Detection

Authors: Weitang Liu, Xiaoyun Wang, John D. Owens, Yixuan Li

Abstract: Determining whether inputs are out-of-distribution (OOD) is an essential building block for safely deploying machine learning models in the open world. However, previous methods relying on the softmax confidence score suffer from overconfident posterior distributions for OOD data. We propose a unified framework for OOD detection that uses an energy score. We show that energy scores better distingu… ▽ More Determining whether inputs are out-of-distribution (OOD) is an essential building block for safely deploying machine learning models in the open world. However, previous methods relying on the softmax confidence score suffer from overconfident posterior distributions for OOD data. We propose a unified framework for OOD detection that uses an energy score. We show that energy scores better distinguish in- and out-of-distribution samples than the traditional approach using the softmax scores. Unlike softmax confidence scores, energy scores are theoretically aligned with the probability density of the inputs and are less susceptible to the overconfidence issue. Within this framework, energy can be flexibly used as a scoring function for any pre-trained neural classifier as well as a trainable cost function to shape the energy surface explicitly for OOD detection. On a CIFAR-10 pre-trained WideResNet, using the energy score reduces the average FPR (at TPR 95%) by 18.03% compared to the softmax confidence score. With energy-based training, our method outperforms the state-of-the-art on common benchmarks. △ Less

Submitted 26 April, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2003.01527 [pdf, other]

Fast Gunrock Subgraph Matching (GSM) on GPUs

Authors: Leyuan Wang, John D. Owens

Abstract: In this paper, we propose a GPU-efficient subgraph isomorphism algorithm using the Gunrock graph analytic framework, GSM (Gunrock Subgraph Matching), to compute graph matching on GPUs. In contrast to previous approaches on the CPU which are based on depth-first traversal, GSM is BFS-based: possible matches are explored simultaneously in a breadth-first strategy. The advantage of using BFS-based tr… ▽ More In this paper, we propose a GPU-efficient subgraph isomorphism algorithm using the Gunrock graph analytic framework, GSM (Gunrock Subgraph Matching), to compute graph matching on GPUs. In contrast to previous approaches on the CPU which are based on depth-first traversal, GSM is BFS-based: possible matches are explored simultaneously in a breadth-first strategy. The advantage of using BFS-based traversal is that we can leverage the massively parallel processing capabilities of the GPU. The disadvantage is the generation of more intermediate results. We propose several optimization techniques to cope with the problem. Our implementation follows a filtering-and-verification strategy. While most previous work on GPUs requires one-/two-step joining, we use one-step verification to decide the candidates in current frontier of nodes. Our implementation has a speedup up to 4x over previous GPU state-of-the-art implementation. △ Less

Submitted 11 March, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: arXiv admin note: text overlap with arXiv:1909.02127

arXiv:1911.09228 [pdf, other]

Unsupervised Object Segmentation with Explicit Localization Module

Authors: Weitang Liu, Lifeng Wei, James Sharpnack, John D. Owens

Abstract: In this paper, we propose a novel architecture that iteratively discovers and segments out the objects of a scene based on the image reconstruction quality. Different from other approaches, our model uses an explicit localization module that localizes objects of the scene based on the pixel-level reconstruction qualities at each iteration, where simpler objects tend to be reconstructed better at e… ▽ More In this paper, we propose a novel architecture that iteratively discovers and segments out the objects of a scene based on the image reconstruction quality. Different from other approaches, our model uses an explicit localization module that localizes objects of the scene based on the pixel-level reconstruction qualities at each iteration, where simpler objects tend to be reconstructed better at earlier iterations and thus are segmented out first. We show that our localization module improves the quality of the segmentation, especially on a challenging background. △ Less

Submitted 20 November, 2019; originally announced November 2019.

arXiv:1910.02158 [pdf, other]

RDMA vs. RPC for Implementing Distributed Data Structures

Authors: Benjamin Brock, Yuxin Chen, Jiakun Yan, John D. Owens, Aydın Buluç, Katherine Yelick

Abstract: Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a ha… ▽ More Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures. △ Less

Submitted 14 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

arXiv:1909.02127 [pdf, other]

Fast BFS-Based Triangle Counting on GPUs

Authors: Leyuan Wang, John D. Owens

Abstract: In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compar… ▽ More In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84x. △ Less

Submitted 4 September, 2019; originally announced September 2019.

arXiv:1908.01407 [pdf, other]

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Authors: Carl Yang, Aydin Buluc, John D. Owens

Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph… ▽ More High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model. △ Less

Submitted 14 June, 2021; v1 submitted 4 August, 2019; originally announced August 2019.

Comments: 50 pages, 14 figures, 14 tables, to appear in ACM Transactions on Mathematical Software

arXiv:1902.08767 [pdf, other]

doi 10.1145/3337680

VoroCrust: Voronoi Meshing Without Clip**

Authors: Ahmed Abdelkader, Chandrajit L. Bajaj, Mohamed S. Ebeida, Ahmed H. Mahmoud, Scott A. Mitchell, John D. Owens, Ahmad A. Rushdi

Abstract: Polyhedral meshes are increasingly becoming an attractive option with particular advantages over traditional meshes for certain applications. What has been missing is a robust polyhedral meshing algorithm that can handle broad classes of domains exhibiting arbitrarily curved boundaries and sharp features. In addition, the power of primal-dual mesh pairs, exemplified by Voronoi-Delaunay meshes, has… ▽ More Polyhedral meshes are increasingly becoming an attractive option with particular advantages over traditional meshes for certain applications. What has been missing is a robust polyhedral meshing algorithm that can handle broad classes of domains exhibiting arbitrarily curved boundaries and sharp features. In addition, the power of primal-dual mesh pairs, exemplified by Voronoi-Delaunay meshes, has been recognized as an important ingredient in numerous formulations. The VoroCrust algorithm is the first provably-correct algorithm for conforming polyhedral Voronoi meshing for non-convex and non-manifold domains with guarantees on the quality of both surface and volume elements. A robust refinement process estimates a suitable sizing field that enables the careful placement of Voronoi seeds across the surface circumventing the need for clip** and avoiding its many drawbacks. The algorithm has the flexibility of filling the interior by either structured or random samples, while preserving all sharp features in the output mesh. We demonstrate the capabilities of the algorithm on a variety of models and compare against state-of-the-art polyhedral meshing methods based on clipped Voronoi cells establishing the clear advantage of VoroCrust output. △ Less

Submitted 22 November, 2023; v1 submitted 23 February, 2019; originally announced February 2019.

Comments: 18 pages (including appendix), 18 figures. Version without compressed images available on https://www.sandia.gov/app/uploads/sites/217/2023/09/VoroCrust.pdf. Supplemental materials available on https://www.sandia.gov/app/uploads/sites/217/2023/09/VoroCrust_supplemental_materials.pdf

ACM Class: I.3.5

Journal ref: ACM Transaction on Graphics, Vol. 39, No. 3, Article No. 23 (May 2020)

arXiv:1805.07706 [pdf, other]

Object Localization with a Weakly Supervised CapsNet

Authors: Weitang Liu, Emad Barsoum, John D. Owens

Abstract: Inspired by CapsNet's routing-by-agreement mechanism with its ability to learn object properties, we explore if those properties in turn can determine new properties of the objects, such as the locations. We then propose a CapsNet architecture with object coordinate atoms and a modified routing-by-agreement algorithm with unevenly distributed initial routing probabilities. The model is based on Ca… ▽ More Inspired by CapsNet's routing-by-agreement mechanism with its ability to learn object properties, we explore if those properties in turn can determine new properties of the objects, such as the locations. We then propose a CapsNet architecture with object coordinate atoms and a modified routing-by-agreement algorithm with unevenly distributed initial routing probabilities. The model is based on CapsNet but uses a routing algorithm to find the objects' approximate positions in the image coordinate system. We also discussed how to derive the property of translation through coordinate atoms and we show the importance of sparse representation. We train our model on the single moving MNIST dataset with class labels. Our model can learn and derive the coordinates of the digits better than its convolution counterpart that lacks a routing-by-agreement algorithm, and can also perform well when testing on the multi-digit moving MNIST and KTH datasets. The results show our method reaches the state-of-art performance on object localization without any extra localization techniques and modules as in prior work. △ Less

Submitted 2 December, 2019; v1 submitted 20 May, 2018; originally announced May 2018.

arXiv:1804.06926 [pdf, other]

doi 10.1145/2915516.2915521

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Authors: Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens

Abstract: We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic… ▽ More We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the best performance due to its ability to exploit efficient filtering steps to remove unnecessary work and its high-performance set-intersection core. △ Less

Submitted 18 April, 2018; originally announced April 2018.

Comments: 7 pages, 6 figures and 2 tables

arXiv:1804.03327 [pdf, other]

Implementing Push-Pull Efficiently in GraphBLAS

Authors: Carl Yang, Aydin Buluc, John D. Owens

Abstract: We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show tha… ▽ More We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show that these graph algorithm optimizations, which together constitute DOBFS, can be neatly and separably described using linear algebra and can be expressed in the GraphBLAS linear-algebra-based framework. We provide experimental evidence that with these optimizations, a DOBFS expressed in a linear-algebra-based graph framework attains competitive performance with state-of-the-art graph frameworks on the GPU and on a multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph. △ Less

Submitted 20 June, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: 11 pages, 7 figures, International Conference on Parallel Processing (ICPP) 2018

arXiv:1803.08601 [pdf, other]

Design Principles for Sparse Matrix Multiplication on the GPU

Authors: Carl Yang, Aydin Buluc, John D. Owens

Abstract: We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancin… ▽ More We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets. △ Less

Submitted 12 June, 2018; v1 submitted 22 March, 2018; originally announced March 2018.

Comments: 16 pages, 7 figures, International European Conference on Parallel and Distributed Computing (Euro-Par) 2018

arXiv:1803.06078 [pdf, other]

Sampling Conditions for Conforming Voronoi Meshing by the VoroCrust Algorithm

Authors: Ahmed Abdelkader, Chandrajit L. Bajaj, Mohamed S. Ebeida, Ahmed H. Mahmoud, Scott A. Mitchell, John D. Owens, Ahmad A. Rushdi

Abstract: We study the problem of decomposing a volume bounded by a smooth surface into a collection of Voronoi cells. Unlike the dual problem of conforming Delaunay meshing, a principled solution to this problem for generic smooth surfaces remained elusive. VoroCrust leverages ideas from $α$-shapes and the power crust algorithm to produce unweighted Voronoi cells conforming to the surface, yielding the fir… ▽ More We study the problem of decomposing a volume bounded by a smooth surface into a collection of Voronoi cells. Unlike the dual problem of conforming Delaunay meshing, a principled solution to this problem for generic smooth surfaces remained elusive. VoroCrust leverages ideas from $α$-shapes and the power crust algorithm to produce unweighted Voronoi cells conforming to the surface, yielding the first provably-correct algorithm for this problem. Given an $ε$-sample on the bounding surface, with a weak $σ$-sparsity condition, we work with the balls of radius $δ$ times the local feature size centered at each sample. The corners of this union of balls are the Voronoi sites, on both sides of the surface. The facets common to cells on opposite sides reconstruct the surface. For appropriate values of $ε$, $σ$ and $δ$, we prove that the surface reconstruction is isotopic to the bounding surface. With the surface protected, the enclosed volume can be further decomposed into an isotopic volume mesh of fat Voronoi cells by generating a bounded number of sites in its interior. Compared to state-of-the-art methods based on clip**, VoroCrust cells are full Voronoi cells, with convexity and fatness guarantees. Compared to the power crust algorithm, VoroCrust cells are not filtered, are unweighted, and offer greater flexibility in meshing the enclosed volume by either structured grids or random samples. △ Less

Submitted 14 April, 2018; v1 submitted 16 March, 2018; originally announced March 2018.

Comments: polished up version, results essentially unchanged

ACM Class: I.3.5

arXiv:1803.03922 [pdf, other]

Scalable Breadth-First Search on a GPU Cluster

Authors: Yuechao Pan, Roger Pearce, John D. Owens

Abstract: On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-d… ▽ More On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system. △ Less

Submitted 5 April, 2018; v1 submitted 11 March, 2018; originally announced March 2018.

Comments: 12 pages, 13 figures. To appear at IPDPS 2018

arXiv:1710.11246 [pdf, other]

A Dynamic Hash Table for the GPU

Authors: Saman Ashkiani, Martin Farach-Colton, John D. Owens

Abstract: We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocki… ▽ More We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios. △ Less

Submitted 1 March, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018)

arXiv:1707.05354 [pdf, other]

GPU LSM: A Dynamic Dictionary Data Structure for the GPU

Authors: Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, John D. Owens

Abstract: We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operat… ▽ More We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU. △ Less

Submitted 1 March, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

Comments: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'18)

arXiv:1701.01189 [pdf, other]

doi 10.1145/3108139

GPU Multisplit: an extended study of a parallel algorithm

Authors: Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens

Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input… ▽ More Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s). △ Less

Submitted 18 May, 2017; v1 submitted 4 January, 2017; originally announced January 2017.

Comments: 44 pages, to appear on ACM Transactions on Parallel Computing (TOPC): "Special Issue: invited papers from PPoPP 2016". This is an extended version of PPoPP'16 paper "GPU Multisplit"

Journal ref: ACM Transactions on Parallel Computing (TOPC), Volume 4, Issue 1, Article No. 2, August 2017

arXiv:1701.01170 [pdf, other]

Gunrock: GPU Graph Analytics

Authors: Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, John D. Owens

Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a… ▽ More For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library. △ Less

Submitted 4 January, 2017; originally announced January 2017.

Comments: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU"

arXiv:1606.05790 [pdf, other]

doi 10.1109/HPEC.2016.7761646

Mathematical Foundations of the GraphBLAS

Authors: Jeremy Kepner, Peter Aaltonen, David Bader, Aydın Buluc, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, Timothy Mattson

Abstract: The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th… ▽ More The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low. △ Less

Submitted 13 July, 2016; v1 submitted 18 June, 2016; originally announced June 2016.

Comments: 9 pages; 11 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 2016. arXiv admin note: text overlap with arXiv:1504.01039

arXiv:1504.04804 [pdf, other]

Multi-GPU Graph Analytics

Authors: Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, John D. Owens

Abstract: We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details.… ▽ More We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details. We analyze the theoretical and practical limits to scalability in the context of varying graph primitives and datasets. We describe several optimizations, such as direction optimizing traversal, and a just-enough memory allocation scheme, for better performance and smaller memory consumption. Compared to previous work, we achieve best-of-class performance across operations and datasets, including excellent strong and weak scalability on most primitives as we increase the number of GPUs in the system. △ Less

Submitted 1 March, 2017; v1 submitted 19 April, 2015; originally announced April 2015.

Comments: 12 pages. Final version submitted to IPDPS 2017

arXiv:1501.05387 [pdf, other]

doi 10.1145/2851141.2851145

Gunrock: A High-Performance Graph Processing Library on the GPU

Authors: Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, John D. Owens

Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a verte… ▽ More For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library. △ Less

Submitted 22 February, 2016; v1 submitted 21 January, 2015; originally announced January 2015.

Comments: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5)

ACM Class: D.1.3

arXiv:1404.6293 [pdf, other]

doi 10.1145/2766973

Piko: A Design Framework for Programmable Graphics Pipelines

Authors: Anjul Patney, Stanley Tzeng, Kerry A. Seitz Jr., John D. Owens

Abstract: We present Piko, a framework for designing, optimizing, and retargeting implementations of graphics pipelines on multiple architectures. Piko programmers express a graphics pipeline by organizing the computation within each stage into spatial bins and specifying a scheduling preference for these bins. Our compiler, Pikoc, compiles this input into an optimized implementation targeted to a massively… ▽ More We present Piko, a framework for designing, optimizing, and retargeting implementations of graphics pipelines on multiple architectures. Piko programmers express a graphics pipeline by organizing the computation within each stage into spatial bins and specifying a scheduling preference for these bins. Our compiler, Pikoc, compiles this input into an optimized implementation targeted to a massively-parallel GPU or a multicore CPU. Piko manages work granularity in a programmable and flexible manner, allowing programmers to build load-balanced parallel pipeline implementations, to exploit spatial and producer-consumer locality in a pipeline implementation, and to explore tradeoffs between these considerations. We demonstrate that Piko can implement a wide range of pipelines, including rasterization, Reyes, ray tracing, rasterization/ray tracing hybrid, and deferred rendering. Piko allows us to implement efficient graphics pipelines with relative ease and to quickly explore design alternatives by modifying the spatial binning configurations and scheduling preferences for individual stages, all while delivering real-time performance that is within a factor six of state-of-the-art rendering systems. △ Less

Submitted 29 January, 2015; v1 submitted 24 April, 2014; originally announced April 2014.

Comments: 13 pages, updated for 2015

ACM Class: I.3.1; I.3.2

Journal ref: ACM Transactions on Graphics 34, 4 (July 2015), 147:1-147:13

arXiv:1309.1230 [pdf, other]

A GPU Implementation for Two-Dimensional Shallow Water Modeling

Authors: Kerry A. Seitz Jr., Alex Kennedy, Owen Ransom, Bassam A. Younis, John D. Owens

Abstract: In this paper, we present a GPU implementation of a two-dimensional shallow water model. Water simulations are useful for modeling floods, river/reservoir behavior, and dam break scenarios. Our GPU implementation shows vast performance improvements over the original Fortran implementation. By taking advantage of the GPU, researchers and engineers will be able to study water systems more efficientl… ▽ More In this paper, we present a GPU implementation of a two-dimensional shallow water model. Water simulations are useful for modeling floods, river/reservoir behavior, and dam break scenarios. Our GPU implementation shows vast performance improvements over the original Fortran implementation. By taking advantage of the GPU, researchers and engineers will be able to study water systems more efficiently and in greater detail. △ Less

Submitted 5 September, 2013; originally announced September 2013.

Comments: 9 pages, 1 figure

arXiv:1302.3917 [pdf, other]

doi 10.1145/2522528

k-d Darts: Sampling by k-Dimensional Flat Searches

Authors: Mohamed S. Ebeida, Anjul Patney, Scott A. Mitchell, Keith R. Dalbey, Andrew A. Davidson, John D. Owens

Abstract: We formalize the notion of sampling a function using k-d darts. A k-d dart is a set of independent, mutually orthogonal, k-dimensional subspaces called k-d flats. Each dart has d choose k flats, aligned with the coordinate axes for efficiency. We show that k-d darts are useful for exploring a function's properties, such as estimating its integral, or finding an exemplar above a threshold. We descr… ▽ More We formalize the notion of sampling a function using k-d darts. A k-d dart is a set of independent, mutually orthogonal, k-dimensional subspaces called k-d flats. Each dart has d choose k flats, aligned with the coordinate axes for efficiency. We show that k-d darts are useful for exploring a function's properties, such as estimating its integral, or finding an exemplar above a threshold. We describe a recipe for converting an algorithm from point sampling to k-d dart sampling, assuming the function can be evaluated along a k-d flat. We demonstrate that k-d darts are more efficient than point-wise samples in high dimensions, depending on the characteristics of the sampling domain: e.g. the subregion of interest has small volume and evaluating the function along a flat is not too expensive. We present three concrete applications using line darts (1-d darts): relaxed maximal Poisson-disk sampling, high-quality rasterization of depth-of-field blur, and estimation of the probability of failure from a response surface for uncertainty quantification. In these applications, line darts achieve the same fidelity output as point darts in less time. We also demonstrate the accuracy of higher dimensional darts for a volume estimation problem. For Poisson-disk sampling, we use significantly less memory, enabling the generation of larger point clouds in higher dimensions. △ Less

Submitted 15 February, 2013; originally announced February 2013.

Comments: 19 pages 16 figures

ACM Class: I.3.5

Journal ref: Transactions on Graphics, vol. 33, no. 1 (Jan 2014) pp. 3:1--3:16

arXiv:1201.2936 [pdf, other]

Finding Convex Hulls Using Quickhull on the GPU

Authors: Stanley Tzeng, John D. Owens

Abstract: We present a convex hull algorithm that is accelerated on commodity graphics hardware. We analyze and identify the hurdles of writing a recursive divide and conquer algorithm on the GPU and divise a framework for representing this class of problems. Our framework transforms the recursive splitting step into a permutation step that is well-suited for graphics hardware. Our convex hull algorithm of… ▽ More We present a convex hull algorithm that is accelerated on commodity graphics hardware. We analyze and identify the hurdles of writing a recursive divide and conquer algorithm on the GPU and divise a framework for representing this class of problems. Our framework transforms the recursive splitting step into a permutation step that is well-suited for graphics hardware. Our convex hull algorithm of choice is Quickhull. Our parallel Quickhull implementation (for both 2D and 3D cases) achieves an order of magnitude speedup over standard computational geometry libraries. △ Less

Submitted 13 January, 2012; originally announced January 2012.

Comments: 11 pages

arXiv:1110.4623 [pdf, other]

Efficient Synchronization Primitives for GPUs

Authors: Jeff A. Stuart, John D. Owens

Abstract: In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of slee** on the G… ▽ More In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of slee** on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention. △ Less

Submitted 20 October, 2011; originally announced October 2011.

Comments: 13 pages with appendix, several figures, plans to submit to CompSci conference in early 2012

ACM Class: D.4.1; I.3.2

Showing 1–36 of 36 results for author: Owens, J D