Search | arXiv e-print repository

Code Generation for a Variety of Accelerators for a Graph DSL

Authors: Ashwina Kumar, M. Venkata Krishna, Prasanna Bartakke, Rahul Kumar, Rajesh Pandian M, Nibedita Behera, Rupesh Nasre

Abstract: Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if pro… ▽ More Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if programmers and domain-experts get to focus only on the sequential computation and a compiler takes care of auto-generating the parallel code. On the other side, there is a variety in the number of target hardware devices, and achieving optimal performance often demands coding in specific languages or frameworks. Our goal in this work is to focus on a graph DSL which allows the domain-experts to write almost-sequential code, and generate parallel code for different accelerators from the same algorithmic specification. In particular, we illustrate code generation from the StarPlat graph DSL for NVIDIA, AMD, and Intel GPUs using CUDA, OpenCL, SYCL, and OpenACC programming languages. Using a suite of ten large graphs and four popular algorithms, we present the efficacy of StarPlat's versatile code generator. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.03317

arXiv:2306.08252 [pdf, other]

GraphVine: A Data Structure to Optimize Dynamic Graph Processing on GPUs

Authors: Rohith Krishnan S, Venkata Kalyan Tavva, Rupesh Nasre

Abstract: Graph processing on GPUs is gaining momentum due to the high throughputs observed compared to traditional CPUs, attributed to the vast number of processing cores on GPUs that can exploit parallelism in graph analytics. This paper discusses a graph data structure for dynamic graph processing on GPUs. Unlike static graphs, dynamic graphs mutate over their lifetime through vertex and/or edge batch up… ▽ More Graph processing on GPUs is gaining momentum due to the high throughputs observed compared to traditional CPUs, attributed to the vast number of processing cores on GPUs that can exploit parallelism in graph analytics. This paper discusses a graph data structure for dynamic graph processing on GPUs. Unlike static graphs, dynamic graphs mutate over their lifetime through vertex and/or edge batch updates. The proposed work aims to provide fast batch updates and graph querying without consuming too much GPU memory. Experimental results show improved initialization timings by 1968-1269024%, improved batch edge insert timings by 30-30047%, and improved batch edge delete timings by 50-25262% while consuming less memory when the batch size is large. △ Less

Submitted 26 July, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

arXiv:2305.17813 [pdf, ps, other]

Meerkat: A framework for Dynamic Graph Algorithms on GPUs

Authors: Kevin Jude Concessao, Unnikrishnan Cheramangalath, MJ Ricky Dev, Rupesh Nasre

Abstract: Graph algorithms are challenging to implement due to their varying topology and irregular access patterns. Real-world graphs are dynamic in nature and routinely undergo edge and vertex additions, as well as, deletions. Typical examples of dynamic graphs are social networks, collaboration networks, and road networks. Applying static algorithms repeatedly on dynamic graphs is inefficient. Unfortunat… ▽ More Graph algorithms are challenging to implement due to their varying topology and irregular access patterns. Real-world graphs are dynamic in nature and routinely undergo edge and vertex additions, as well as, deletions. Typical examples of dynamic graphs are social networks, collaboration networks, and road networks. Applying static algorithms repeatedly on dynamic graphs is inefficient. Unfortunately, we know little about how to efficiently process dynamic graphs on massively parallel architectures such as GPUs. Existing approaches to represent and process dynamic graphs are either not general or inefficient. In this work, we propose a library-based framework for dynamic graph algorithms that proposes a GPU-tailored graph representation and exploits the warp-cooperative execution model. The library, named Meerkat, builds upon a recently proposed dynamic graph representation on GPUs. This representation exploits a hashtable-based mechanism to store a vertex's neighborhood. Meerkat also enables fast iteration through a group of vertices, such as the whole set of vertices or the neighbors of a vertex. Based on the efficient iterative patterns encoded in Meerkat, we implement dynamic versions of the popular graph algorithms such as breadth-first search, single-source shortest paths, triangle counting, weakly connected components, and PageRank. Compared to the state-of-the-art dynamic graph analytics framework Hornet, Meerkat is $12.6\times$, $12.94\times$, and $6.1\times$ faster, for query, insert, and delete operations, respectively. Using a variety of real-world graphs, we observe that Meerkat significantly improves the efficiency of the underlying dynamic graph algorithm. Meerkat performs $1.17\times$ for BFS, $1.32\times$ for SSSP, $1.74\times$ for PageRank, and $6.08\times$ for WCC, better than Hornet on average. △ Less

Submitted 2 June, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

arXiv:2305.03317 [pdf, other]

StarPlat: A Versatile DSL for Graph Analytics

Authors: Nibedita Behera, Ashwina Kumar, Ebenezer Rajadurai T, Sai Nitish, Rajesh Pandian M, Rupesh Nasre

Abstract: Graphs model several real-world phenomena. With the growth of unstructured and semi-structured data, parallelization of graph algorithms is inevitable. Unfortunately, due to inherent irregularity of computation, memory access, and communication, graph algorithms are traditionally challenging to parallelize. To tame this challenge, several libraries, frameworks, and domain-specific languages (DSLs)… ▽ More Graphs model several real-world phenomena. With the growth of unstructured and semi-structured data, parallelization of graph algorithms is inevitable. Unfortunately, due to inherent irregularity of computation, memory access, and communication, graph algorithms are traditionally challenging to parallelize. To tame this challenge, several libraries, frameworks, and domain-specific languages (DSLs) have been proposed to reduce the parallel programming burden of the users, who are often domain experts. However, existing frameworks to model graph algorithms typically target a single architecture. In this paper, we present a graph DSL, named StarPlat, that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification. In particular, the DSL compiler generates OpenMP for multi-core, MPI for distributed, and CUDA for many-core GPUs. Since these three are completely different parallel programming paradigms, binding them together under the same language is challenging. We share our experience with the language design. Central to our compiler is an intermediate representation which allows a common representation of the high-level program, from which individual backend code generations begin. We demonstrate the expressiveness of StarPlat by specifying four graph algorithms: betweenness centrality computation, page rank computation, single-source shortest paths, and triangle counting. We illustrate the effectiveness of our approach by comparing the performance of the generated codes with that obtained with hand-crafted library codes. We find that the generated code is competitive to library-based codes in many cases. More importantly, we show the feasibility to generate efficient codes for different target architectures from the same algorithmic specification of graph algorithms. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: 30 pages, 21 figures

arXiv:2008.05718 [pdf, other]

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations

Authors: Ashirbad Mishra, Sathish Vadhiyar, Rupesh Nasre, Keshav **ali

Abstract: Betweenness centrality (BC) is an important graph analytical application for large-scale graphs. While there are many efforts for parallelizing betweenness centrality algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel fine-grained CPU-GPU hybrid algorithm that partitions a graph into CPU and GPU partitions, and performs BC computations for the graph on both the CPU… ▽ More Betweenness centrality (BC) is an important graph analytical application for large-scale graphs. While there are many efforts for parallelizing betweenness centrality algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel fine-grained CPU-GPU hybrid algorithm that partitions a graph into CPU and GPU partitions, and performs BC computations for the graph on both the CPU and GPU resources simultaneously with very small number of CPU-GPU communications. The forward phase in our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We also perform a novel hybrid and asynchronous backward phase that performs minimal CPU-GPU synchronizations. Evaluations using a large number of graphs with different characteristics show that our hybrid approach gives 80% improvement in performance, and 80-90% less CPU-GPU communications than an existing hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach. △ Less

Submitted 13 August, 2020; originally announced August 2020.

arXiv:2004.11044 [pdf, ps, other]

BOLD: An Ontology-based Log Debugger for C Programs

Authors: Dileep Kumar P, Rupesh Nasre, Sreenivasa Kumar P

Abstract: The different activities related to debugging such as program instrumentation, representation of execution trace and analysis of trace are not typically performed in an unified framework. We propose \textit{BOLD}, an Ontology-based Log Debugger to unify and standardize the activities in debugging. The syntactical information of programs can be represented in the from of Resource Description Framew… ▽ More The different activities related to debugging such as program instrumentation, representation of execution trace and analysis of trace are not typically performed in an unified framework. We propose \textit{BOLD}, an Ontology-based Log Debugger to unify and standardize the activities in debugging. The syntactical information of programs can be represented in the from of Resource Description Framework (RDF) triples. Using the BOLD framework, the programs can be automatically instrumented by using declarative specifications over these triples. A salient feature of the framework is to store the execution trace of the program also as RDF triples called \textit{trace triples}. These triples can be queried to implement the common debug operations. The novelty of the framework is to abstract these triples as \textit{spans} for high-level reasoning. A span gives a way of examining the values of a particular variable over certain portion of the program execution. The properties of the spans are defined formally as a Web Ontology Language (OWL) ontology called \textit{Program Debug (PD) Ontology}. Using the span abstraction and PD ontology, end-users can debug a given buggy program in a standard manner. A notable feature of using ontology is that users can accurately debug in some cases of missing information, which can be practically useful. To demonstrate the feasibility of the proposed framework, we have debugged the programs in a standard bug benchmark suite Software-artifact Infrastructure Repository (SIR). Experiments show that the querying time is almost the same as in \texttt{gdb}. The reasoning time depends on the sub-language of OWL. We find that the expressibility offered by OWL-DL language is sufficient for the bugs in SIR programs; but to achieve scalability in reasoning, a restricted OWL-RL language is required. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: 16 pages, 4 tables, 4 figures

arXiv:1912.00966 [pdf, other]

GPU Algorithm for Earliest Arrival Time Problem in Public Transport Networks

Authors: Chirayu Anant Haryan, G. Ramakrishna, Rupesh Nasre, Allam Dinesh Reddy

Abstract: Given a temporal graph G, a source vertex s, and a departure time at source vertex t_s, the earliest arrival time problem EAT is to start from s on or after t_s and reach all the vertices in G as early as possible. Ni et al. have proposed a parallel algorithm for EAT and obtained a speedup up to 9.5 times on real-world graphs with respect to the connection-scan serial algorithm by using multi-core… ▽ More Given a temporal graph G, a source vertex s, and a departure time at source vertex t_s, the earliest arrival time problem EAT is to start from s on or after t_s and reach all the vertices in G as early as possible. Ni et al. have proposed a parallel algorithm for EAT and obtained a speedup up to 9.5 times on real-world graphs with respect to the connection-scan serial algorithm by using multi-core processors. We propose a topology-driven parallel algorithm for EAT on public transport networks and implement using general-purpose programming on the graphics processing unit GPU. A temporal edge or connection in a temporal graph for a public transport network is associated with a departure time and a duration time, and many connections exist from u to v for an edge (u,v). We propose two pruning techniques connection-type and clustering, and use arithmetic progression technique appropriately to process many connections of an edge, without scanning all of them. In the connection-type technique, the connections of an edge with the same duration are grouped together. In the clustering technique, we follow 24-hour format and the connections of an edge are partitioned into 24 clusters so that the departure time of connections in the i^{th} cluster is at least i-hour and at most i+1-hour. The arithmetic progression technique helps to store a sequence of departure times of various connections in a compact way. We propose a hybrid approach to combine the three techniques connection-type, clustering and arithmetic progression in an appropriate way. Our techniques achieve an average speedup up to 59.09 times when compared to the existing connection-scan serial algorithm running on CPU. Also, the average speedup of our algorithm is 12.48 times against the parallel edge-scan-dependency graph algorithm running on GPU. △ Less

Submitted 2 December, 2019; originally announced December 2019.

arXiv:1903.01665 [pdf, ps, other]

Custom Code Generation for a Graph DSL

Authors: Bikash Gogoi, Unnikrishnan Cheramangalath, Rupesh Nasre

Abstract: Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue by designing doma… ▽ More Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue by designing domain-specific languages (DSLs) for graph algorithms, which restrict generality but allow efficient code-generation for various backends. Such DSLs are, however, too rigid, and do not adapt to changes in backends or to input graph properties or to both. We narrate our experiences in making an existing DSL, named Falcon, adaptive. The biggest challenge in the process is to not change the DSL code for specifying the algorithm. We illustrate the effectiveness of our proposal by auto-generating codes for vertex-based versus edge-based graph processing, synchronous versus asynchronous execution, and CPU versus GPU backends from the same specification. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Comments: submitted to Conference

MSC Class: 68-00

arXiv:1711.00231 [pdf, other]

Dynamic Load Balancing Strategies for Graph Applications on GPUs

Authors: Ananya Raval, Rupesh Nasre, Vivek Kumar, Vasudevan R, Sathish Vadhiyar, Keshav **ali

Abstract: Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, it can result… ▽ More Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, it can result in skewed task-distribution, leading to poor load-balance. In contrast, if the work-assignment uses edge-based graph partitioning, the load-balancing is better, but the memory requirement is relatively higher. This makes it unsuitable for large graphs. In this work, we propose three techniques for improved load-balancing of graph applications on GPUs. Each technique brings in unique advantages, and a user may have to employ a specific technique based on the requirement. Using Breadth First Search and Single Source Shortest Paths as our processing kernels, we illustrate the effectiveness of each of the proposed techniques in comparison to the existing node-based and edge-based mechanisms. △ Less

Submitted 1 November, 2017; originally announced November 2017.

Showing 1–9 of 9 results for author: Nasre, R