Skip to main content

Showing 1–36 of 36 results for author: Owens, J D

.
  1. arXiv:2404.12674  [pdf, other

    cs.DC cs.LG cs.PF

    Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

    Authors: Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

    Abstract: Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance i… ▽ More

    Submitted 27 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

    Comments: 12 pages, 11 figures, 4 tables

  2. arXiv:2404.11591  [pdf, other

    cs.DS

    The EDGE Language: Extended General Einsums for Graph Algorithms

    Authors: Toluwanimi O. Odemuyiwa, Joel S. Emer, John D. Owens

    Abstract: In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor,… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: 79 pages, 14 figures

  3. arXiv:2310.00496  [pdf, other

    cs.CV cs.LG

    The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks

    Authors: Cameron Shinn, Collin McCarthy, Saurav Muralidharan, Muhammad Osama, John D. Owens

    Abstract: We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and theoretical inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the theoretical speedup becomes equal to the actual speedup when the corresponding dense and sparse kernels ar… ▽ More

    Submitted 6 November, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

  4. arXiv:2306.10410  [pdf, other

    cs.DC

    BOBA: A Parallel Lightweight Graph Reordering Algorithm with Heavyweight Implications

    Authors: Matthew Drescher, Muhammad A. Awad, Serban D. Porumbescu, John D. Owens

    Abstract: We describe a simple parallel-friendly lightweight graph reordering algorithm for COO graphs (edge lists). Our ``Batched Order By Attachment'' (BOBA) algorithm is linear in the number of edges in terms of reads and linear in the number of vertices for writes through to main memory. It is highly parallelizable on GPUs\@. We show that, compared to a randomized baseline, the ordering produced gives… ▽ More

    Submitted 21 June, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

  5. A Programming Model for GPU Load Balancing

    Authors: Muhammad Osama, Serban D. Porumbescu, John D. Owens

    Abstract: We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

    Comments: This work previously appeared in the author's PhD dissertation, available at arXiv:2212.08964 Also published in the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '23)

  6. arXiv:2301.03598  [pdf, other

    cs.DS cs.DC

    Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

    Authors: Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens

    Abstract: We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless o… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: This work previously appeared in the author's PhD dissertation, available at arXiv:2212.08964

  7. Essentials of Parallel Graph Analytics

    Authors: Muhammad Osama, Serban D. Porumbescu, John D. Owens

    Abstract: We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these essential components, we propose an abstraction that captures all the significant programming models within graph analytics, such as bulk-synchronous, asynchronous, shared-memory, messa… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning

  8. arXiv:2201.07821  [pdf, other

    cs.LG cs.PF

    Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

    Authors: Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens

    Abstract: We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-b… ▽ More

    Submitted 16 November, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

    Comments: 11 pages, 11 figures. Appears in the 29th IEEE International Conference on High-Performance Computing, Data, and Analytics (HiPC 2022)

  9. arXiv:2112.00132  [pdf, other

    cs.DC

    Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

    Authors: Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, John D. Owens

    Abstract: We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems wit… ▽ More

    Submitted 30 November, 2021; originally announced December 2021.

    Comments: 12 pages, 4 figures

  10. arXiv:2109.14682  [pdf, other

    cs.GR cs.PL

    Supporting Unified Shader Specialization by Co-opting C++ Features

    Authors: Kerry A. Seitz Jr., Theresa Foley, Serban D. Porumbescu, John D. Owens

    Abstract: Modern unified programming models (such as CUDA and SYCL) that combine host (CPU) code and GPU code into the same programming language, same file, and same lexical scope lack adequate support for GPU code specialization, which is a key optimization in real-time graphics. Furthermore, current methods used to implement specialization do not translate to a unified environment. In this paper, we creat… ▽ More

    Submitted 16 July, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: 17 pages, 1 figure, 2 tables, 3 code listings. To be published in Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5, 3, Article 25 (July 2022)

    ACM Class: I.3.6; D.3.2; D.3.4

    Journal ref: Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5, 3, Article 25 (July 2022), 17 pages

  11. arXiv:2108.07232  [pdf, other

    cs.DS

    Better GPU Hash Tables

    Authors: Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, Martín Farach-Colton, John D. Owens

    Abstract: We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table's different operations. Our results show that a bucketed cuckoo hash table that uses… ▽ More

    Submitted 17 December, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Our implementation is available at https://github.com/owensgroup/BGHT

  12. arXiv:2010.03759  [pdf, other

    cs.LG cs.AI

    Energy-based Out-of-distribution Detection

    Authors: Weitang Liu, Xiaoyun Wang, John D. Owens, Yixuan Li

    Abstract: Determining whether inputs are out-of-distribution (OOD) is an essential building block for safely deploying machine learning models in the open world. However, previous methods relying on the softmax confidence score suffer from overconfident posterior distributions for OOD data. We propose a unified framework for OOD detection that uses an energy score. We show that energy scores better distingu… ▽ More

    Submitted 26 April, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

  13. arXiv:2003.01527  [pdf, other

    cs.DC

    Fast Gunrock Subgraph Matching (GSM) on GPUs

    Authors: Leyuan Wang, John D. Owens

    Abstract: In this paper, we propose a GPU-efficient subgraph isomorphism algorithm using the Gunrock graph analytic framework, GSM (Gunrock Subgraph Matching), to compute graph matching on GPUs. In contrast to previous approaches on the CPU which are based on depth-first traversal, GSM is BFS-based: possible matches are explored simultaneously in a breadth-first strategy. The advantage of using BFS-based tr… ▽ More

    Submitted 11 March, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

    Comments: arXiv admin note: text overlap with arXiv:1909.02127

  14. arXiv:1911.09228  [pdf, other

    cs.CV cs.LG eess.IV

    Unsupervised Object Segmentation with Explicit Localization Module

    Authors: Weitang Liu, Lifeng Wei, James Sharpnack, John D. Owens

    Abstract: In this paper, we propose a novel architecture that iteratively discovers and segments out the objects of a scene based on the image reconstruction quality. Different from other approaches, our model uses an explicit localization module that localizes objects of the scene based on the pixel-level reconstruction qualities at each iteration, where simpler objects tend to be reconstructed better at e… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

  15. arXiv:1910.02158  [pdf, other

    cs.DC

    RDMA vs. RPC for Implementing Distributed Data Structures

    Authors: Benjamin Brock, Yuxin Chen, Jiakun Yan, John D. Owens, Aydın Buluç, Katherine Yelick

    Abstract: Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a ha… ▽ More

    Submitted 14 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

  16. arXiv:1909.02127  [pdf, other

    cs.DC

    Fast BFS-Based Triangle Counting on GPUs

    Authors: Leyuan Wang, John D. Owens

    Abstract: In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compar… ▽ More

    Submitted 4 September, 2019; originally announced September 2019.

  17. arXiv:1908.01407  [pdf, other

    cs.DC cs.MS

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Authors: Carl Yang, Aydin Buluc, John D. Owens

    Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph… ▽ More

    Submitted 14 June, 2021; v1 submitted 4 August, 2019; originally announced August 2019.

    Comments: 50 pages, 14 figures, 14 tables, to appear in ACM Transactions on Mathematical Software

  18. arXiv:1902.08767  [pdf, other

    cs.GR cs.CG

    VoroCrust: Voronoi Meshing Without Clip**

    Authors: Ahmed Abdelkader, Chandrajit L. Bajaj, Mohamed S. Ebeida, Ahmed H. Mahmoud, Scott A. Mitchell, John D. Owens, Ahmad A. Rushdi

    Abstract: Polyhedral meshes are increasingly becoming an attractive option with particular advantages over traditional meshes for certain applications. What has been missing is a robust polyhedral meshing algorithm that can handle broad classes of domains exhibiting arbitrarily curved boundaries and sharp features. In addition, the power of primal-dual mesh pairs, exemplified by Voronoi-Delaunay meshes, has… ▽ More

    Submitted 22 November, 2023; v1 submitted 23 February, 2019; originally announced February 2019.

    Comments: 18 pages (including appendix), 18 figures. Version without compressed images available on https://www.sandia.gov/app/uploads/sites/217/2023/09/VoroCrust.pdf. Supplemental materials available on https://www.sandia.gov/app/uploads/sites/217/2023/09/VoroCrust_supplemental_materials.pdf

    ACM Class: I.3.5

    Journal ref: ACM Transaction on Graphics, Vol. 39, No. 3, Article No. 23 (May 2020)

  19. arXiv:1805.07706  [pdf, other

    cs.CV

    Object Localization with a Weakly Supervised CapsNet

    Authors: Weitang Liu, Emad Barsoum, John D. Owens

    Abstract: Inspired by CapsNet's routing-by-agreement mechanism with its ability to learn object properties, we explore if those properties in turn can determine new properties of the objects, such as the locations. We then propose a CapsNet architecture with object coordinate atoms and a modified routing-by-agreement algorithm with unevenly distributed initial routing probabilities. The model is based on Ca… ▽ More

    Submitted 2 December, 2019; v1 submitted 20 May, 2018; originally announced May 2018.

  20. A Comparative Study on Exact Triangle Counting Algorithms on the GPU

    Authors: Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens

    Abstract: We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic… ▽ More

    Submitted 18 April, 2018; originally announced April 2018.

    Comments: 7 pages, 6 figures and 2 tables

  21. arXiv:1804.03327  [pdf, other

    cs.DC

    Implementing Push-Pull Efficiently in GraphBLAS

    Authors: Carl Yang, Aydin Buluc, John D. Owens

    Abstract: We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show tha… ▽ More

    Submitted 20 June, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

    Comments: 11 pages, 7 figures, International Conference on Parallel Processing (ICPP) 2018

  22. arXiv:1803.08601  [pdf, other

    cs.DC

    Design Principles for Sparse Matrix Multiplication on the GPU

    Authors: Carl Yang, Aydin Buluc, John D. Owens

    Abstract: We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancin… ▽ More

    Submitted 12 June, 2018; v1 submitted 22 March, 2018; originally announced March 2018.

    Comments: 16 pages, 7 figures, International European Conference on Parallel and Distributed Computing (Euro-Par) 2018

  23. arXiv:1803.06078  [pdf, other

    cs.CG

    Sampling Conditions for Conforming Voronoi Meshing by the VoroCrust Algorithm

    Authors: Ahmed Abdelkader, Chandrajit L. Bajaj, Mohamed S. Ebeida, Ahmed H. Mahmoud, Scott A. Mitchell, John D. Owens, Ahmad A. Rushdi

    Abstract: We study the problem of decomposing a volume bounded by a smooth surface into a collection of Voronoi cells. Unlike the dual problem of conforming Delaunay meshing, a principled solution to this problem for generic smooth surfaces remained elusive. VoroCrust leverages ideas from $α$-shapes and the power crust algorithm to produce unweighted Voronoi cells conforming to the surface, yielding the fir… ▽ More

    Submitted 14 April, 2018; v1 submitted 16 March, 2018; originally announced March 2018.

    Comments: polished up version, results essentially unchanged

    ACM Class: I.3.5

  24. arXiv:1803.03922  [pdf, other

    cs.DC cs.DS

    Scalable Breadth-First Search on a GPU Cluster

    Authors: Yuechao Pan, Roger Pearce, John D. Owens

    Abstract: On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-d… ▽ More

    Submitted 5 April, 2018; v1 submitted 11 March, 2018; originally announced March 2018.

    Comments: 12 pages, 13 figures. To appear at IPDPS 2018

  25. arXiv:1710.11246  [pdf, other

    cs.DC

    A Dynamic Hash Table for the GPU

    Authors: Saman Ashkiani, Martin Farach-Colton, John D. Owens

    Abstract: We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocki… ▽ More

    Submitted 1 March, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018)

  26. arXiv:1707.05354  [pdf, other

    cs.DC

    GPU LSM: A Dynamic Dictionary Data Structure for the GPU

    Authors: Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, John D. Owens

    Abstract: We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operat… ▽ More

    Submitted 1 March, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

    Comments: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'18)

  27. GPU Multisplit: an extended study of a parallel algorithm

    Authors: Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens

    Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input… ▽ More

    Submitted 18 May, 2017; v1 submitted 4 January, 2017; originally announced January 2017.

    Comments: 44 pages, to appear on ACM Transactions on Parallel Computing (TOPC): "Special Issue: invited papers from PPoPP 2016". This is an extended version of PPoPP'16 paper "GPU Multisplit"

    Journal ref: ACM Transactions on Parallel Computing (TOPC), Volume 4, Issue 1, Article No. 2, August 2017

  28. arXiv:1701.01170  [pdf, other

    cs.DC

    Gunrock: GPU Graph Analytics

    Authors: Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, John D. Owens

    Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a… ▽ More

    Submitted 4 January, 2017; originally announced January 2017.

    Comments: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU"

  29. arXiv:1606.05790  [pdf, other

    cs.MS astro-ph.IM cs.DC cs.DS

    Mathematical Foundations of the GraphBLAS

    Authors: Jeremy Kepner, Peter Aaltonen, David Bader, Aydın Buluc, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, Timothy Mattson

    Abstract: The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th… ▽ More

    Submitted 13 July, 2016; v1 submitted 18 June, 2016; originally announced June 2016.

    Comments: 9 pages; 11 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 2016. arXiv admin note: text overlap with arXiv:1504.01039

  30. arXiv:1504.04804  [pdf, other

    cs.DC

    Multi-GPU Graph Analytics

    Authors: Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, John D. Owens

    Abstract: We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details.… ▽ More

    Submitted 1 March, 2017; v1 submitted 19 April, 2015; originally announced April 2015.

    Comments: 12 pages. Final version submitted to IPDPS 2017

  31. Gunrock: A High-Performance Graph Processing Library on the GPU

    Authors: Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, John D. Owens

    Abstract: For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for develo** a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a verte… ▽ More

    Submitted 22 February, 2016; v1 submitted 21 January, 2015; originally announced January 2015.

    Comments: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5)

    ACM Class: D.1.3

  32. Piko: A Design Framework for Programmable Graphics Pipelines

    Authors: Anjul Patney, Stanley Tzeng, Kerry A. Seitz Jr., John D. Owens

    Abstract: We present Piko, a framework for designing, optimizing, and retargeting implementations of graphics pipelines on multiple architectures. Piko programmers express a graphics pipeline by organizing the computation within each stage into spatial bins and specifying a scheduling preference for these bins. Our compiler, Pikoc, compiles this input into an optimized implementation targeted to a massively… ▽ More

    Submitted 29 January, 2015; v1 submitted 24 April, 2014; originally announced April 2014.

    Comments: 13 pages, updated for 2015

    ACM Class: I.3.1; I.3.2

    Journal ref: ACM Transactions on Graphics 34, 4 (July 2015), 147:1-147:13

  33. arXiv:1309.1230  [pdf, other

    cs.DC

    A GPU Implementation for Two-Dimensional Shallow Water Modeling

    Authors: Kerry A. Seitz Jr., Alex Kennedy, Owen Ransom, Bassam A. Younis, John D. Owens

    Abstract: In this paper, we present a GPU implementation of a two-dimensional shallow water model. Water simulations are useful for modeling floods, river/reservoir behavior, and dam break scenarios. Our GPU implementation shows vast performance improvements over the original Fortran implementation. By taking advantage of the GPU, researchers and engineers will be able to study water systems more efficientl… ▽ More

    Submitted 5 September, 2013; originally announced September 2013.

    Comments: 9 pages, 1 figure

  34. k-d Darts: Sampling by k-Dimensional Flat Searches

    Authors: Mohamed S. Ebeida, Anjul Patney, Scott A. Mitchell, Keith R. Dalbey, Andrew A. Davidson, John D. Owens

    Abstract: We formalize the notion of sampling a function using k-d darts. A k-d dart is a set of independent, mutually orthogonal, k-dimensional subspaces called k-d flats. Each dart has d choose k flats, aligned with the coordinate axes for efficiency. We show that k-d darts are useful for exploring a function's properties, such as estimating its integral, or finding an exemplar above a threshold. We descr… ▽ More

    Submitted 15 February, 2013; originally announced February 2013.

    Comments: 19 pages 16 figures

    ACM Class: I.3.5

    Journal ref: Transactions on Graphics, vol. 33, no. 1 (Jan 2014) pp. 3:1--3:16

  35. arXiv:1201.2936  [pdf, other

    cs.CG cs.DS cs.GR

    Finding Convex Hulls Using Quickhull on the GPU

    Authors: Stanley Tzeng, John D. Owens

    Abstract: We present a convex hull algorithm that is accelerated on commodity graphics hardware. We analyze and identify the hurdles of writing a recursive divide and conquer algorithm on the GPU and divise a framework for representing this class of problems. Our framework transforms the recursive splitting step into a permutation step that is well-suited for graphics hardware. Our convex hull algorithm of… ▽ More

    Submitted 13 January, 2012; originally announced January 2012.

    Comments: 11 pages

  36. arXiv:1110.4623  [pdf, other

    cs.OS cs.DC cs.DS cs.GR

    Efficient Synchronization Primitives for GPUs

    Authors: Jeff A. Stuart, John D. Owens

    Abstract: In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of slee** on the G… ▽ More

    Submitted 20 October, 2011; originally announced October 2011.

    Comments: 13 pages with appendix, several figures, plans to submit to CompSci conference in early 2012

    ACM Class: D.4.1; I.3.2