Search | arXiv e-print repository

Nimble GNN Embedding with Tensor-Train Decomposition

Authors: Chunxing Yin, Da Zheng, Israt Nisa, Christos Faloutos, George Karypis, Richard Vuduc

Abstract: This paper describes a new method for representing embedding tables of graph neural networks (GNNs) more compactly via tensor-train (TT) decomposition. We consider the scenario where (a) the graph data that lack node features, thereby requiring the learning of embeddings during training; and (b) we wish to exploit GPU platforms, where smaller tables are needed to reduce host-to-GPU communication e… ▽ More This paper describes a new method for representing embedding tables of graph neural networks (GNNs) more compactly via tensor-train (TT) decomposition. We consider the scenario where (a) the graph data that lack node features, thereby requiring the learning of embeddings during training; and (b) we wish to exploit GPU platforms, where smaller tables are needed to reduce host-to-GPU communication even for large-memory GPUs. The use of TT enables a compact parameterization of the embedding, rendering it small enough to fit entirely on modern GPUs even for massive graphs. When combined with judicious schemes for initialization and hierarchical graph partitioning, this approach can reduce the size of node embedding vectors by 1,659 times to 81,362 times on large publicly available benchmark datasets, achieving comparable or better accuracy and significant speedups on multi-GPU systems. In some cases, our model without explicit node features on input can even match the accuracy of models that use node features. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: To appear in the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD 22)

arXiv:2204.05959 [pdf]

"Smarter" NICs for faster molecular dynamics: a case study

Authors: Sara Karamati, Clayton Hughes, K. Scott Hemmert, Ryan E. Grant, W. Whit Schonbein, Scott Levy, Thomas M. Conte, Jeffrey Young, Richard W. Vuduc

Abstract: This work evaluates the benefits of using a "smart" network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standar… ▽ More This work evaluates the benefits of using a "smart" network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standard Intel server host using microbenchmarks and MiniMD. In MiniMD, we identify two distinct classes of computation, namely core computation and maintenance computation, which are executed in sequence. We restructure the algorithm and code to weaken this dependence and increase task parallelism, thereby making it possible to increase utilization of the BlueField-2 concurrently with the host. We evaluate our implementation on a cluster consisting of 16 dual-socket Intel Broadwell host nodes with one BlueField-2 per host-node. Our results show that while the overall compute performance of BlueField-2 is limited, using them with a modified MiniMD algorithm allows for up to 20% speedup over the host CPU baseline with no loss in simulation accuracy. △ Less

Submitted 12 April, 2022; originally announced April 2022.

arXiv:2012.01571 [pdf, other]

Online Model Swap** in Architectural Simulation

Authors: Patrick Lavin, Jeffrey Young, Rich Vuduc, Jonathan Beard

Abstract: As systems and applications grow more complex, detailed simulation takes an ever increasing amount of time. The prospect of increased simulation time resulting in slower design iteration forces architects to use simpler models, such as spreadsheets, when they want to iterate quickly on a design. However, the task of migrating from a simple simulation to one with more detail often requires multiple… ▽ More As systems and applications grow more complex, detailed simulation takes an ever increasing amount of time. The prospect of increased simulation time resulting in slower design iteration forces architects to use simpler models, such as spreadsheets, when they want to iterate quickly on a design. However, the task of migrating from a simple simulation to one with more detail often requires multiple executions to find where simple models could be effective, which could be more expensive than running the detailed model in the first place. Also, architects must often rely on intuition to choose these simpler models, further complicating the problem. In this work, we present a method of bridging the gap between simple and detailed simulation by monitoring simulation behavior online and automatically swap** out detailed models with simpler statistical approximations. We demonstrate the potential of our methodology by implementing it in the open-source simulator SVE-Cachesim to swap out the level one data cache (L1D) within a memory hierarchy. This proof of concept demonstrates that our technique can handle a non-trivial use-case in not just approximation of local time-invariant statistics, but also those that vary with time (e.g., the L1D is a form of a time-series function), and downstream side-effects (e.g., the L1D filters accesses for the level two cache). Our simulation swaps out the built-in cache model with only an 8% error in the simulated cycle count while using the approximated cache models for over 90% of the simulation, and our simpler models require two to eight times less computation per "execution" of the model △ Less

Submitted 2 December, 2020; originally announced December 2020.

arXiv:1911.08630 [pdf, other]

CUP: Cluster Pruning for Compressing Deep Neural Networks

Authors: Rahul Duggal, Cao Xiao, Richard Vuduc, Jimeng Sun

Abstract: We propose Cluster Pruning (CUP) for compressing and accelerating deep neural networks. Our approach prunes similar filters by clustering them based on features derived from both the incoming and outgoing weight connections. With CUP, we overcome two limitations of prior work-(1) non-uniform pruning: CUP can efficiently determine the ideal number of filters to prune in each layer of a neural netwo… ▽ More We propose Cluster Pruning (CUP) for compressing and accelerating deep neural networks. Our approach prunes similar filters by clustering them based on features derived from both the incoming and outgoing weight connections. With CUP, we overcome two limitations of prior work-(1) non-uniform pruning: CUP can efficiently determine the ideal number of filters to prune in each layer of a neural network. This is in contrast to prior methods that either prune all layers uniformly or otherwise use resource-intensive methods such as manual sensitivity analysis or reinforcement learning to determine the ideal number. (2) Single-shot operation: We extend CUP to CUP-SS (for CUP single shot) whereby pruning is integrated into the initial training phase itself. This leads to large savings in training time compared to traditional pruning pipelines. Through extensive evaluation on multiple datasets (MNIST, CIFAR-10, and Imagenet) and models(VGG-16, Resnets-18/34/56) we show that CUP outperforms recent state of the art. Specifically, CUP-SS achieves 2.2x flops reduction for a Resnet-50 model trained on Imagenet while staying within 0.9% top-5 accuracy. It saves over 14 hours in training time with respect to the original Resnet-50. The code to reproduce results is available. △ Less

Submitted 19 November, 2019; originally announced November 2019.

arXiv:1904.03329 [pdf, other]

Load-Balanced Sparse MTTKRP on GPUs

Authors: Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, P. Sadayappan

Abstract: Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs… ▽ More Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables high-performance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations. △ Less

Submitted 5 April, 2019; originally announced April 2019.

arXiv:1901.02775 [pdf, other]

Programming Strategies for Irregular Algorithms on the Emu Chick

Authors: Eric Hein, Srinivas Eswar, Abdurrahman Yaşar, Jiajia Li, Jeffrey S. Young, Thomas M. Conte, Ümit V. Çatalyürek, Rich Vuduc, Jason Riedy, Bora Uçar

Abstract: The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and pro… ▽ More The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and programming differences from more typical, non-migratory platforms, but there has not yet been an analysis of algorithms on this system. This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system. We also define and justify relative metrics to compare prototype FPGA-based hardware with established ASIC architectures. The Chick currently supports up to 68x scaling for graph alignment, 80 MTEPS for BFS on balanced graphs, and 50\% of measured STREAM bandwidth for SpMV. △ Less

Submitted 3 December, 2018; originally announced January 2019.

arXiv:1811.03743 [pdf, other]

Spatter: A Tool for Evaluating Gather / Scatter Performance

Authors: Patrick Lavin, Jeffrey Young, Jason Riedy, Richard Vuduc, Aaron Vose, Dan Ernst

Abstract: This paper describes a new benchmark tool, Spatter, for assessing memory system architectures in the context of a specific category of indexed accesses known as gather and scatter. These types of operations are increasingly used to express sparse and irregular data access patterns, and they have widespread utility in many modern HPC applications including scientific simulations, data mining and an… ▽ More This paper describes a new benchmark tool, Spatter, for assessing memory system architectures in the context of a specific category of indexed accesses known as gather and scatter. These types of operations are increasingly used to express sparse and irregular data access patterns, and they have widespread utility in many modern HPC applications including scientific simulations, data mining and analysis computations, and graph processing. However, many traditional benchmarking tools like STREAM, STRIDE, and GUPS focus on characterizing only uniform stride or fully random accesses despite evidence that modern applications use varied sets of more complex access patterns. Spatter is an open-source benchmark that provides a tunable and configurable framework to benchmark a variety of indexed access patterns, including variations of gather/scatter that are seen in HPC mini-apps evaluated in this work. The design of Spatter includes tunable backends for OpenMP and CUDA, and experiments show how it can be used to evaluate 1) uniform access patterns for CPU and GPU, 2) prefetching regimes for gather/scatter, 3) compiler implementations of vectorization for gather/scatter, and 4) trace-driven "proxy patterns" that reflect the patterns found in multiple applications. The results from Spatter experiments show that GPUs typically outperform CPUs for these operations, and that Spatter can better represent the performance of some cache-dependent mini-apps than traditional STREAM bandwidth measurements. △ Less

Submitted 7 July, 2020; v1 submitted 8 November, 2018; originally announced November 2018.

Comments: Updated paper results and text to reflect longer conference submission limit

arXiv:1809.07696 [pdf, other]

doi 10.1016/j.parco.2019.04.012

A Microbenchmark Characterization of the Emu Chick

Authors: Jeffrey S. Young, Eric Hein, Srinivas Eswar, Patrick Lavin, Jiajia Li, Jason Riedy, Richard Vuduc, Thomas M. Conte

Abstract: The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer cores for doing… ▽ More The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation (Hein, et al. AsHES 2018) of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality. △ Less

Submitted 31 May, 2019; v1 submitted 7 September, 2018; originally announced September 2018.

Journal ref: Parallel Computing, 2019, ISSN 0167-8191

arXiv:1808.07832 [pdf, ps, other]

A Simple Methodology for Computing Families of Algorithms

Authors: Devangi N. Parikh, Margaret E. Myers, Richard Vuduc, Robert A. van de Geijn

Abstract: Discovering "good" algorithms for an operation is often considered an art best left to experts. What if there is a simple methodology, an algorithm, for systematically deriving a family of algorithms as well as their cost analyses, so that the best algorithm can be chosen? We discuss such an approach for deriving loop-based algorithms. The example used to illustrate this methodology, evaluation of… ▽ More Discovering "good" algorithms for an operation is often considered an art best left to experts. What if there is a simple methodology, an algorithm, for systematically deriving a family of algorithms as well as their cost analyses, so that the best algorithm can be chosen? We discuss such an approach for deriving loop-based algorithms. The example used to illustrate this methodology, evaluation of a polynomial, is itself simple yet the best algorithm that results is surprising to a non-expert: Horner's rule. We finish by discussing recent advances that make this approach highly practical for the domain of high-performance linear algebra software libraries. △ Less

Submitted 20 August, 2018; originally announced August 2018.

Report number: FLAME Working Note #87, The University of Texas at Austin, Department of Computer Science, Technical Report TR-18-06

arXiv:1805.00569 [pdf, other]

doi 10.1145/3205289.3205290

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems

Authors: Yang You, James Demmel, Cho-Jui Hsieh, Richard Vuduc

Abstract: We propose two new methods to address the weak scaling problems of KRR: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling effic… ▽ More We propose two new methods to address the weak scaling problems of KRR: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591times speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505 times speedup (theoretical speedup: 4096 times). △ Less

Submitted 1 May, 2018; originally announced May 2018.

Comments: This paper has been accepted by ACM International Conference on Supercomputing (ICS) 2018

arXiv:1803.05473 [pdf, other]

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenoty**

Authors: Ioakeim Perros, Evangelos E. Papalexakis, Haesun Park, Richard Vuduc, Xiaowei Yan, Christopher Defilippi, Walter F. Stewart, Jimeng Sun

Abstract: This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics… ▽ More This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain_M and SUSTain_T, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425X faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist. △ Less

Submitted 14 March, 2018; originally announced March 2018.

arXiv:1703.04219 [pdf, other]

SPARTan: Scalable PARAFAC2 for Large & Sparse Data

Authors: Ioakeim Perros, Evangelos E. Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, Jimeng Sun

Abstract: In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To han… ▽ More In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To handle such data, the state-of-the-art tensor model is the so-called PARAFAC2, which yields interpretable and robust output and can naturally handle sparse data. However, its main limitation up to now has been the lack of efficient algorithms that can handle large-scale datasets. In this work, we fill this gap by develo** a scalable method to compute the PARAFAC2 decomposition of large and sparse datasets, called SPARTan. Our method exploits special structure within PARAFAC2, leading to a novel algorithmic reformulation that is both fast (in absolute time) and more memory-efficient than prior work. We evaluate SPARTan on both synthetic and real datasets, showing 22X performance gains over the best previous implementation and also handling larger problem instances for which the baseline fails. Furthermore, we are able to apply SPARTan to the mining of temporally-evolving phenotypes on data taken from real and medically complex pediatric patients. The clinical meaningfulness of the phenotypes identified in this process, as well as their temporal evolution over time for several patients, have been endorsed by clinical experts. △ Less

Submitted 12 March, 2017; originally announced March 2017.

arXiv:1611.04255

Efficient Communications in Training Large Scale Neural Networks

Authors: Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Zenglin Xu

Abstract: We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overal… ▽ More We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to $P$, where $P$ is the number of GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales like $O(\log P)$. LP also demonstrate up to 2x faster bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD. △ Less

Submitted 15 April, 2017; v1 submitted 14 November, 2016; originally announced November 2016.

Comments: This paper has been withdrawn by the author due to a crucial sign error in equation 1

arXiv:1610.07722 [pdf, other]

Sparse Hierarchical Tucker Factorization and its Application to Healthcare

Authors: Ioakeim Perros, Robert Chen, Richard Vuduc, Jimeng Sun

Abstract: We propose a new tensor factorization method, called the Sparse Hierarchical-Tucker (Sparse H-Tucker), for sparse and high-order data tensors. Sparse H-Tucker is inspired by its namesake, the classical Hierarchical Tucker method, which aims to compute a tree-structured factorization of an input data set that may be readily interpreted by a domain expert. However, Sparse H-Tucker uses a nested samp… ▽ More We propose a new tensor factorization method, called the Sparse Hierarchical-Tucker (Sparse H-Tucker), for sparse and high-order data tensors. Sparse H-Tucker is inspired by its namesake, the classical Hierarchical Tucker method, which aims to compute a tree-structured factorization of an input data set that may be readily interpreted by a domain expert. However, Sparse H-Tucker uses a nested sampling technique to overcome a key scalability problem in Hierarchical Tucker, which is the creation of an unwieldy intermediate dense core tensor; the result of our approach is a faster, more space-efficient, and more accurate method. We extensively test our method on a real healthcare dataset, which is collected from 30K patients and results in an 18th order sparse data tensor. Unlike competing methods, Sparse H-Tucker can analyze the full data set on a single multi-threaded machine. It can also do so more accurately and in less time than the state-of-the-art: on a 12th order subset of the input data, Sparse H-Tucker is 18x more accurate and 7.5x faster than a previously state-of-the-art method. Even for analyzing low order tensors (e.g., 4-order), our method requires close to an order of magnitude less time and over two orders of magnitude less memory, as compared to traditional tensor factorization methods such as CP and Tucker. Moreover, we observe that Sparse H-Tucker scales nearly linearly in the number of non-zero tensor elements. The resulting model also provides an interpretable disease hierarchy, which is confirmed by a clinical expert. △ Less

Submitted 25 October, 2016; originally announced October 2016.

Comments: This is an extended version of a paper presented at the 15th IEEE International Conference on Data Mining (ICDM 2015)

arXiv:1603.00491 [pdf, other]

Wanted: Floating-Point Add Round-off Error instruction

Authors: Marat Dukhan, Richard Vuduc, Jason Riedy

Abstract: We propose a new instruction (FPADDRE) that computes the round-off error in floating-point addition. We explain how this instruction benefits high-precision arithmetic operations in applications where double precision is not sufficient. Performance estimates on Intel Haswell, Intel Skylake, and AMD Steamroller processors, as well as Intel Knights Corner co-processor, demonstrate that such an instr… ▽ More We propose a new instruction (FPADDRE) that computes the round-off error in floating-point addition. We explain how this instruction benefits high-precision arithmetic operations in applications where double precision is not sufficient. Performance estimates on Intel Haswell, Intel Skylake, and AMD Steamroller processors, as well as Intel Knights Corner co-processor, demonstrate that such an instruction would improve the latency of double-double addition by up to 55% and increase double-double addition throughput by up to 103%, with smaller, but non-negligible benefits for double-double multiplication. The new instruction delivers up to 2x speedups on three benchmarks that use high-precision floating-point arithmetic: double-double matrix-matrix multiplication, compensated dot product, and polynomial evaluation via the compensated Horner scheme. △ Less

Submitted 1 March, 2016; originally announced March 2016.

arXiv:1411.1460 [pdf, other]

Branch-Avoiding Graph Algorithms

Authors: Oded Green, Marat Dukhan, Richard Vuduc

Abstract: This paper quantifies the impact of branches and branch mispredictions on the single-core performance for two classes of graph problems. Specifically, we consider classical algorithms for computing connected components and breadth-first search (BFS). We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algo… ▽ More This paper quantifies the impact of branches and branch mispredictions on the single-core performance for two classes of graph problems. Specifically, we consider classical algorithms for computing connected components and breadth-first search (BFS). We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementations that avoid branches. As a proof-of-concept, we devise such implementations for both the classic top-down algorithm for BFS and the Shiloach-Vishkin algorithm for connected components. We evaluate these implementations on current x86 and ARM-based processors to show the efficacy of the approach. Our results suggest how both compiler writers and architects might exploit this insight to improve graph processing systems more broadly and create better systems for such problems. △ Less

Submitted 5 November, 2014; originally announced November 2014.

ACM Class: C.0; C.4; E.1

arXiv:1309.1828 [pdf]

Sustainable Software Development for Next-Gen Sequencing (NGS) Bioinformatics on Emerging Platforms

Authors: Shel Swenson, Yogesh Simmhan, Viktor Prasanna, Manish Parashar, Jason Riedy, David Bader, Richard Vuduc

Abstract: DNA sequence analysis is fundamental to life science research. The rapid development of next generation sequencing (NGS) technologies, and the richness and diversity of applications it makes feasible, have created an enormous gulf between the potential of this technology and the development of computational methods to realize this potential. Bridging this gap holds possibilities for broad impacts… ▽ More DNA sequence analysis is fundamental to life science research. The rapid development of next generation sequencing (NGS) technologies, and the richness and diversity of applications it makes feasible, have created an enormous gulf between the potential of this technology and the development of computational methods to realize this potential. Bridging this gap holds possibilities for broad impacts toward multiple grand challenges and offers unprecedented opportunities for software innovation and research. We argue that NGS-enabled applications need a critical mass of sustainable software to benefit from emerging computing platforms' transformative potential. Accumulating the necessary critical mass will require leaders in computational biology, bioinformatics, computer science, and computer engineering work together to identify core opportunity areas, critical software infrastructure, and software sustainability challenges. Furthermore, due to the quickly changing nature of both bioinformatics software and accelerator technology, we conclude that creating sustainable accelerated bioinformatics software means constructing a sustainable bridge between the two fields. In particular, sustained collaboration between domain developers and technology experts is needed to develop the accelerated kernels, libraries, frameworks and middleware that could provide the needed flexible link from NGS bioinformatics applications to emerging platforms. △ Less

Submitted 26 October, 2013; v1 submitted 7 September, 2013; originally announced September 2013.

Comments: 4 pages

Showing 1–17 of 17 results for author: Vuduc, R