-
A Scalable and Effective Alternative to Graph Transformers
Authors:
Kaan Sancak,
Zhigang Hua,
** Fang,
Yan Xie,
Andrey Malevich,
Bo Long,
Muhammed Fatih Balin,
Ümit V. Çatalyürek
Abstract:
Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic comple…
▽ More
Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Cooperative Minibatching in Graph Neural Networks
Authors:
Muhammed Fatih Balin,
Dominique LaSalle,
Ümit V. Çatalyürek
Abstract:
Significant computational resources are required to train Graph Neural Networks (GNNs) at a large scale, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlap** data. However, the commonly implemented Independent Minibatching a…
▽ More
Significant computational resources are required to train Graph Neural Networks (GNNs) at a large scale, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlap** data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work per seed vertex as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems.
△ Less
Submitted 21 October, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Open Problems in (Hyper)Graph Decomposition
Authors:
Deepak Ajwani,
Rob H. Bisseling,
Katrin Casel,
Ümit V. Çatalyürek,
Cédric Chevalier,
Florian Chudigiewitsch,
Marcelo Fonseca Faraj,
Michael Fellows,
Lars Gottesbüren,
Tobias Heuer,
George Karypis,
Kamer Kaya,
Jakub Lacki,
Johannes Langguth,
Xiaoye Sherry Li,
Ruben Mayer,
Johannes Meintrup,
Yosuke Mizutani,
François Pellegrini,
Fabrizio Petrini,
Frances Rosamond,
Ilya Safro,
Sebastian Schlag,
Christian Schulz,
Roohani Sharma
, et al. (4 additional authors not shown)
Abstract:
Large networks are useful in a wide range of applications. Sometimes problem instances are composed of billions of entities. Decomposing and analyzing these structures helps us gain new insights about our surroundings. Even if the final application concerns a different problem (such as traversal, finding paths, trees, and flows), decomposing large graphs is often an important subproblem for comple…
▽ More
Large networks are useful in a wide range of applications. Sometimes problem instances are composed of billions of entities. Decomposing and analyzing these structures helps us gain new insights about our surroundings. Even if the final application concerns a different problem (such as traversal, finding paths, trees, and flows), decomposing large graphs is often an important subproblem for complexity reduction or parallelization. This report is a summary of discussions that happened at Dagstuhl seminar 23331 on "Recent Trends in Graph Decomposition" and presents currently open problems and future directions in the area of (hyper)graph decomposition.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
SGORP: A Subgradient-based Method for d-Dimensional Rectilinear Partitioning
Authors:
Muhammed Fatih Balin,
Xiao**g An,
Abdurrahman Yaşar,
Ümit V. Çatalyürek
Abstract:
Partitioning for load balancing is a crucial first step to parallelize any type of computation. In this work, we propose SGORP, a new spatial partitioning method based on Subgradient Optimization, to solve the $d$-dimensional Rectilinear Partitioning Problem (RPP). Our proposed method allows the use of customizable objective functions as well as some user-specific constraints, such as symmetric pa…
▽ More
Partitioning for load balancing is a crucial first step to parallelize any type of computation. In this work, we propose SGORP, a new spatial partitioning method based on Subgradient Optimization, to solve the $d$-dimensional Rectilinear Partitioning Problem (RPP). Our proposed method allows the use of customizable objective functions as well as some user-specific constraints, such as symmetric partitioning on selected dimensions. Extensive experimental evaluation using over 600 test matrices shows that our algorithm achieves favorable performance against the state-of-the-art RPP and Symmetric RPP algorithms. Additionally, we show the effectiveness of our algorithm to do application-specific load balancing using two applications as motivation: Triangle Counting and Sparse Matrix Multiplication (SpGEMM), where we model their load-balancing problems as $3$-dimensional RPPs.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Layer-Neighbor Sampling -- Defusing Neighborhood Explosion in GNNs
Authors:
Muhammed Fatih Balın,
Ümit V. Çatalyürek
Abstract:
Graph Neural Networks (GNNs) have received significant attention recently, but training them at a large scale remains a challenge. Mini-batch training coupled with sampling is used to alleviate this challenge. However, existing approaches either suffer from the neighborhood explosion phenomenon or have poor performance. To address these issues, we propose a new sampling algorithm called LAyer-neig…
▽ More
Graph Neural Networks (GNNs) have received significant attention recently, but training them at a large scale remains a challenge. Mini-batch training coupled with sampling is used to alleviate this challenge. However, existing approaches either suffer from the neighborhood explosion phenomenon or have poor performance. To address these issues, we propose a new sampling algorithm called LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for Neighbor Sampling (NS) with the same fanout hyperparameter while sampling up to 7 times fewer vertices, without sacrificing quality. By design, the variance of the estimator of each vertex matches NS from the point of view of a single vertex. Moreover, under the same vertex sampling budget constraints, LABOR converges faster than existing layer sampling approaches and can use up to 112 times larger batch sizes compared to NS.
△ Less
Submitted 19 May, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
PGAbB: A Block-Based Graph Processing Framework for Heterogeneous Platforms
Authors:
Abdurrahman Yasar,
Sivasankaran Rajamanickam,
Jonathan W. Berry,
Umit V. Catalyurek
Abstract:
Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based pr…
▽ More
Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based programming model. This allows a user to express a graph algorithm using kernels that operate on subgraphs. PGAbB support graph computations that fit in host DRAM but not in GPU device memory, and provides simple but effective scheduling techniques to schedule computations to all available resources in a heterogeneous architecture. We have demonstrated that one can easily implement a diverse set of graph algorithms in our framework by develo** five algorithms. Our experimental results show that PGAbB implementations achieve better or competitive performance compared to hand-optimized implementations. Based on our experiments on five graph algorithms and forty-four graphs, in the median, PGAbB achieves 1.6, 1.6, 5.7, 3.4, 4.5, and 2.4 times better performance than GAPBS, Galois, Ligra, LAGraph Galois-GPU, and Gunrock graph processing systems, respectively.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
A Simple and Elegant Mathematical Formulation for the Acyclic DAG Partitioning Problem
Authors:
M. Yusuf Özkaya,
Ümit V. Çatalyürek
Abstract:
This work addresses the NP-Hard problem of acyclic directed acyclic graph (DAG) partitioning problem. The acyclic partitioning problem is defined as partitioning the vertex set of a given directed acyclic graph into disjoint and collectively exhaustive subsets (parts). Parts are to be assigned such that the total sum of the vertex weights within each part satisfies a common upper bound and the tot…
▽ More
This work addresses the NP-Hard problem of acyclic directed acyclic graph (DAG) partitioning problem. The acyclic partitioning problem is defined as partitioning the vertex set of a given directed acyclic graph into disjoint and collectively exhaustive subsets (parts). Parts are to be assigned such that the total sum of the vertex weights within each part satisfies a common upper bound and the total sum of the edge costs that connect nodes across different parts is minimized. Additionally, the quotient graph, i.e., the induced graph where all nodes that are assigned to the same part are contracted to a single node and edges of those are replaced with cumulative edges towards other nodes, is also a directed acyclic graph. That is, the quotient graph itself is also a graph that contains no cycles. Many computational and real-life applications such as in computational task scheduling, RTL simulations, scheduling of rail-rail transshipment tasks and Very Large Scale Integration (VLSI) design make use of acyclic DAG partitioning. We address the need for a simple and elegant mathematical formulation for the acyclic DAG partitioning problem that enables easier understanding, communication, implementation, and experimentation on the problem.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
More Recent Advances in (Hyper)Graph Partitioning
Authors:
Ümit V. Çatalyürek,
Karen D. Devine,
Marcelo Fonseca Faraj,
Lars Gottesbüren,
Tobias Heuer,
Henning Meyerhenke,
Peter Sanders,
Sebastian Schlag,
Christian Schulz,
Daniel Seemaier,
Dorothea Wagner
Abstract:
In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also c…
▽ More
In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.
△ Less
Submitted 30 June, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Efficient Hierarchical State Vector Simulation of Quantum Circuits via Acyclic Graph Partitioning
Authors:
Bo Fang,
M. Yusuf Özkaya,
Ang Li,
Ümit V. Çatalyürek,
Sriram Krishnamoorthy
Abstract:
Early but promising results in quantum computing have been enabled by the concurrent development of quantum algorithms, devices, and materials. Classical simulation of quantum programs has enabled the design and analysis of algorithms and implementation strategies targeting current and anticipated quantum device architectures. In this paper, we present a graph-based approach to achieve efficient q…
▽ More
Early but promising results in quantum computing have been enabled by the concurrent development of quantum algorithms, devices, and materials. Classical simulation of quantum programs has enabled the design and analysis of algorithms and implementation strategies targeting current and anticipated quantum device architectures. In this paper, we present a graph-based approach to achieve efficient quantum circuit simulation. Our approach involves partitioning the graph representation of a given quantum circuit into acyclic sub-graphs/circuits that exhibit better data locality. Simulation of each sub-circuit is organized hierarchically, with the iterative construction and simulation of smaller state vectors, improving overall performance. Also, this partitioning reduces the number of passes through data, improving the total computation time. We present three partitioning strategies and observe that acyclic graph partitioning typically results in the best time-to-solution. In contrast, other strategies reduce the partitioning time at the expense of potentially increased simulation times. Experimental evaluation demonstrates the effectiveness of our approach.
△ Less
Submitted 14 May, 2022;
originally announced May 2022.
-
Batch Dynamic Algorithm to Find k-Cores and Hierarchies
Authors:
Kasimir Gabert,
Ali Pınar,
Ümit V. Çatalyürek
Abstract:
Finding $k$-cores in graphs is a valuable and effective strategy for extracting dense regions of otherwise sparse graphs. We focus on the important problem of maintaining cores on rapidly changing dynamic graphs, where batches of edge changes need to be processed quickly. Prior batch core algorithms have only addressed half the problem of maintaining cores, the problem of maintaining a core decomp…
▽ More
Finding $k$-cores in graphs is a valuable and effective strategy for extracting dense regions of otherwise sparse graphs. We focus on the important problem of maintaining cores on rapidly changing dynamic graphs, where batches of edge changes need to be processed quickly. Prior batch core algorithms have only addressed half the problem of maintaining cores, the problem of maintaining a core decomposition. This finds vertices that are dense, but not regions; it misses connectivity. To address this, we bring an efficient index from community search into the core domain, the Shell Tree Index. We develop a novel dynamic batch algorithm to maintain it that improves efficiency over processing edge-by-edge. We implement our algorithm and experimentally show that with it core queries can be returned on rapidly changing graphs quickly enough for interactive applications. For 1 million edge batches, on many graphs we run over $100\times$ faster than processing edge-by-edge while remaining under re-computing from scratch.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
MG-GCN: Scalable Multi-GPU GCN Training Framework
Authors:
Muhammed Fatih Balın,
Kaan Sancak,
Ümit V. Çatalyürek
Abstract:
Full batch training of Graph Convolutional Network (GCN) models is not feasible on a single GPU for large graphs containing tens of millions of vertices or more. Recent work has shown that, for the graphs used in the machine learning community, communication becomes a bottleneck and scaling is blocked outside of the single machine regime. Thus, we propose MG-GCN, a multi-GPU GCN training framework…
▽ More
Full batch training of Graph Convolutional Network (GCN) models is not feasible on a single GPU for large graphs containing tens of millions of vertices or more. Recent work has shown that, for the graphs used in the machine learning community, communication becomes a bottleneck and scaling is blocked outside of the single machine regime. Thus, we propose MG-GCN, a multi-GPU GCN training framework taking advantage of the high-speed communication links between the GPUs present in multi-GPU systems. MG-GCN employs multiple High-Performance Computing optimizations, including efficient re-use of memory buffers to reduce the memory footprint of training GNN models, as well as communication and computation overlap. These optimizations enable execution on larger datasets, that generally do not fit into memory of a single GPU in state-of-the-art implementations. Furthermore, they contribute to achieve superior speedup compared to the state-of-the-art. For example, MG-GCN achieves super-linear speedup with respect to DGL, on the Reddit graph on both DGX-1 (V100) and DGX-A100.
△ Less
Submitted 16 October, 2021;
originally announced October 2021.
-
A Block-Based Triangle Counting Algorithm on Heterogeneous Environments
Authors:
Abdurrahman Yaşar,
Sivasankaran Rajamanickam,
Jonathan Berry,
Ümit V. Çatalyürek
Abstract:
Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task deco…
▽ More
Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.
△ Less
Submitted 25 September, 2020;
originally announced September 2020.
-
On Symmetric Rectilinear Matrix Partitioning
Authors:
Abdurrahman Yaşar,
Muhammed Fatih Balin,
Xiao**g An,
Kaan Sancak,
Ümit V. Çatalyürek
Abstract:
Even distribution of irregular workload to processing units is crucial for efficient parallelization in many applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices. More specifically, in this work, we address the problem of symmetric rectilinear partitioning of a square matrix. By sy…
▽ More
Even distribution of irregular workload to processing units is crucial for efficient parallelization in many applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices. More specifically, in this work, we address the problem of symmetric rectilinear partitioning of a square matrix. By symmetric, we mean the rows and columns of the matrix are identically partitioned yielding a tiling where the diagonal tiles (blocks) will be squares. We first show that the optimal solution to this problem is NP-hard, and we propose four heuristics to solve two different variants of this problem. We present a thorough analysis of the computational complexities of those proposed heuristics. To make the proposed techniques more applicable in real life application scenarios, we further reduce their computational complexities by utilizing effective sparsification strategies together with an efficient sparse prefix-sum data structure. We experimentally show the proposed algorithms are efficient and effective on more than six hundred test matrices. With sparsification, our methods take less than 3 seconds in the Twitter graph on a modern 24 core system and output a solution whose load imbalance is no worse than 1%.
△ Less
Submitted 16 September, 2020;
originally announced September 2020.
-
Heuristics for Symmetric Rectilinear Matrix Partitioning
Authors:
Abdurrahman Yaşar,
Ümit V. Çatalyürek
Abstract:
Partitioning sparse matrices and graphs is a common and important problem in many scientific and graph analytics applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices, which is needed for tiled (or {\em blocked}) execution of sparse matrix and graph analytics kernels. More specifica…
▽ More
Partitioning sparse matrices and graphs is a common and important problem in many scientific and graph analytics applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices, which is needed for tiled (or {\em blocked}) execution of sparse matrix and graph analytics kernels. More specifically, in this work, we address the problem of symmetric rectilinear partitioning of square matrices. By symmetric, we mean having the same partition on rows and columns of the matrix, yielding a special tiling where the diagonal tiles (blocks) will be squares. We propose five heuristics to solve two different variants of this problem, and present a thorough experimental evaluation showing the effectiveness of the proposed algorithms.
△ Less
Submitted 29 December, 2019; v1 submitted 26 September, 2019;
originally announced September 2019.
-
Programming Strategies for Irregular Algorithms on the Emu Chick
Authors:
Eric Hein,
Srinivas Eswar,
Abdurrahman Yaşar,
Jiajia Li,
Jeffrey S. Young,
Thomas M. Conte,
Ümit V. Çatalyürek,
Rich Vuduc,
Jason Riedy,
Bora Uçar
Abstract:
The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and pro…
▽ More
The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and programming differences from more typical, non-migratory platforms, but there has not yet been an analysis of algorithms on this system.
This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system. We also define and justify relative metrics to compare prototype FPGA-based hardware with established ASIC architectures. The Chick currently supports up to 68x scaling for graph alignment, 80 MTEPS for BFS on balanced graphs, and 50\% of measured STREAM bandwidth for SpMV.
△ Less
Submitted 3 December, 2018;
originally announced January 2019.
-
Geometric Partitioning and Ordering Strategies for Task Map** on Parallel Computers
Authors:
Mehmet Deveci,
Karen D. Devine,
Kevin Pedretti,
Mark A. Taylor,
Sivasankaran Rajamanickam,
Umit V. Catalyurek
Abstract:
We present a new method for map** applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well…
▽ More
We present a new method for map** applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby" cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We also present a number of algorithmic optimizations that exploit specific features of the network or application. We show that, for the structured finite difference mini-application MiniGhost, our map** methods reduced communication time up to 75% relative to MiniGhost's default map** on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 32K cores of an IBM BlueGene/Q with contiguous allocation.
△ Less
Submitted 25 April, 2018;
originally announced April 2018.
-
An Iterative Global Structure-Assisted Labeled Network Aligner
Authors:
Abdurrahman Yaşar,
Ümit V. Çatalyürek
Abstract:
Integrating data from heterogeneous sources is often modeled as merging graphs. Given two or more 'compatible', but not-isomorphic graphs, the first step is to identify a graph alignment, where a potentially partial map** of vertices between two graphs is computed. A significant portion of the literature on this problem only takes the global structure of the input graphs into account. Only more…
▽ More
Integrating data from heterogeneous sources is often modeled as merging graphs. Given two or more 'compatible', but not-isomorphic graphs, the first step is to identify a graph alignment, where a potentially partial map** of vertices between two graphs is computed. A significant portion of the literature on this problem only takes the global structure of the input graphs into account. Only more recent ones additionally use vertex and edge attributes to achieve a more accurate alignment. However, these methods are not designed to scale to map large graphs arising in many modern applications. We propose a new iterative graph aligner, gsaNA, that uses the global structure of the graphs to significantly reduce the problem size and align large graphs with a minimal loss of information. Concretely, we show that our proposed technique is highly flexible, can be used to achieve higher recall, and it is orders of magnitudes faster than the current state of the art techniques.
△ Less
Submitted 10 March, 2018;
originally announced March 2018.
-
Active Betweenness Cardinality: Algorithms and Applications
Authors:
Yusuf Ozkaya,
A. Erdem Sariyuce,
Umit V. Catalyurek,
Ali Pinar
Abstract:
Centrality rankings such as degree, closeness, betweenness, Katz, PageRank, etc. are commonly used to identify critical nodes in a graph. These methods are based on two assumptions that restrict their wider applicability. First, they assume the exact topology of the network is available. Secondly, they do not take into account the activity over the network and only rely on its topology. However, i…
▽ More
Centrality rankings such as degree, closeness, betweenness, Katz, PageRank, etc. are commonly used to identify critical nodes in a graph. These methods are based on two assumptions that restrict their wider applicability. First, they assume the exact topology of the network is available. Secondly, they do not take into account the activity over the network and only rely on its topology. However, in many applications, the network is autonomous, vast, and distributed, and it is hard to collect the exact topology. At the same time, the underlying pairwise activity between node pairs is not uniform and node criticality strongly depends on the activity on the underlying network.
In this paper, we propose active betweenness cardinality, as a new measure, where the node criticalities are based on not the static structure, but the activity of the network. We show how this metric can be computed efficiently by using only local information for a given node and how we can find the most critical nodes starting from only a few nodes. We also show how this metric can be used to monitor a network and identify failed nodes.We present experimental results to show effectiveness by demonstrating how the failed nodes can be identified by measuring active betweenness cardinality of a few nodes in the system.
△ Less
Submitted 28 November, 2017;
originally announced November 2017.
-
Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions
Authors:
Ahmet Erdem Sariyuce,
C. Seshadhri,
Ali Pinar,
Umit V. Catalyurek
Abstract:
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand…
▽ More
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlap** dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
△ Less
Submitted 9 March, 2015; v1 submitted 12 November, 2014;
originally announced November 2014.
-
On Distributed Graph Coloring with Iterative Recoloring
Authors:
Ahmet Erdem Sarıyüce,
Erik Saule,
Ümit V. Çatalyürek
Abstract:
Identifying the sets of operations that can be executed simultaneously is an important problem appearing in many parallel applications. By modeling the operations and their interactions as a graph, one can identify the independent operations by solving a graph coloring problem. Many efficient sequential algorithms are known for this NP-Complete problem, but they are typically unsuitable when the o…
▽ More
Identifying the sets of operations that can be executed simultaneously is an important problem appearing in many parallel applications. By modeling the operations and their interactions as a graph, one can identify the independent operations by solving a graph coloring problem. Many efficient sequential algorithms are known for this NP-Complete problem, but they are typically unsuitable when the operations and their interactions are distributed in the memory of large parallel computers. On top of an existing distributed-memory graph coloring algorithm, we investigate two compatible techniques in this paper for fast and scalable distributed-memory graph coloring. First, we introduce an improvement for the distributed post-processing operation, called recoloring, which drastically improves the number of colors. We propose a novel and efficient communication scheme for recoloring which enables it to scale gracefully. Recoloring must be seeded with an existing coloring of the graph. Our second contribution is to introduce a randomized color selection strategy for initial coloring which quickly produces solutions of modest quality. We extensively evaluate the impact of our new techniques on existing distributed algorithms and show the time-quality tradeoffs. We show that combining an initial randomized coloring with multiple recoloring iterations yields better quality solutions with the smaller runtime at large scale.
△ Less
Submitted 24 July, 2014;
originally announced July 2014.
-
GPU accelerated maximum cardinality matching algorithms for bipartite graphs
Authors:
Mehmet Deveci,
Kamer Kaya,
Bora Ucar,
Umit V. Catalyurek
Abstract:
We design, implement, and evaluate GPU-based algorithms for the maximum cardinality matching problem in bipartite graphs. Such algorithms have a variety of applications in computer science, scientific computing, bioinformatics, and other areas. To the best of our knowledge, ours is the first study which focuses on GPU implementation of the maximum cardinality matching algorithms. We compare the pr…
▽ More
We design, implement, and evaluate GPU-based algorithms for the maximum cardinality matching problem in bipartite graphs. Such algorithms have a variety of applications in computer science, scientific computing, bioinformatics, and other areas. To the best of our knowledge, ours is the first study which focuses on GPU implementation of the maximum cardinality matching algorithms. We compare the proposed algorithms with serial and multicore implementations from the literature on a large set of real-life problems where in majority of the cases one of our GPU-accelerated algorithms is demonstrated to be faster than both the sequential and multicore implementations.
△ Less
Submitted 6 March, 2013;
originally announced March 2013.
-
Incremental Algorithms for Network Management and Analysis based on Closeness Centrality
Authors:
Ahmet Erdem Sariyuce,
Kamer Kaya,
Erik Saule,
Umit V. Catalyurek
Abstract:
Analyzing networks requires complex algorithms to extract meaningful information. Centrality metrics have shown to be correlated with the importance and loads of the nodes in network traffic. Here, we are interested in the problem of centrality-based network management. The problem has many applications such as verifying the robustness of the networks and controlling or improving the entity dissem…
▽ More
Analyzing networks requires complex algorithms to extract meaningful information. Centrality metrics have shown to be correlated with the importance and loads of the nodes in network traffic. Here, we are interested in the problem of centrality-based network management. The problem has many applications such as verifying the robustness of the networks and controlling or improving the entity dissemination. It can be defined as finding a small set of topological network modifications which yield a desired closeness centrality configuration. As a fundamental building block to tackle that problem, we propose incremental algorithms which efficiently update the closeness centrality values upon changes in network topology, i.e., edge insertions and deletions. Our algorithms are proven to be efficient on many real-life networks, especially on small-world networks, which have a small diameter and a spike-shaped shortest distance distribution. In addition to closeness centrality, they can also be a great arsenal for the shortest-path-based management and analysis of the networks. We experimentally validate the efficiency of our algorithms on large networks and show that they update the closeness centrality values of the temporal DBLP-coauthorship network of 1.2 million users 460 times faster than it would take to compute them from scratch. To the best of our knowledge, this is the first work which can yield practical large-scale network management based on closeness centrality values.
△ Less
Submitted 2 March, 2013;
originally announced March 2013.
-
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Authors:
Erik Saule,
Kamer Kaya,
Umit V. Catalyurek
Abstract:
Intel Xeon Phi is a recently released high-performance coprocessor which features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these ap…
▽ More
Intel Xeon Phi is a recently released high-performance coprocessor which features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cutting-edge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge general purpose processors and GPUs.
△ Less
Submitted 5 February, 2013;
originally announced February 2013.
-
Shattering and Compressing Networks for Centrality Analysis
Authors:
Ahmet Erdem Sarıyüce,
Erik Saule,
Kamer Kaya,
Ümit V. Çatalyürek
Abstract:
Who is more important in a network? Who controls the flow between the nodes or whose contribution is significant for connections? Centrality metrics play an important role while answering these questions. The betweenness metric is useful for network analysis and implemented in various tools. Since it is one of the most computationally expensive kernels in graph mining, several techniques have been…
▽ More
Who is more important in a network? Who controls the flow between the nodes or whose contribution is significant for connections? Centrality metrics play an important role while answering these questions. The betweenness metric is useful for network analysis and implemented in various tools. Since it is one of the most computationally expensive kernels in graph mining, several techniques have been proposed for fast computation of betweenness centrality. In this work, we propose and investigate techniques which compress a network and shatter it into pieces so that the rest of the computation can be handled independently for each piece. Although we designed and tuned the shattering process for betweenness, it can be adapted for other centrality metrics in a straightforward manner. Experimental results show that the proposed techniques can be a great arsenal to reduce the centrality computation time for various types of networks.
△ Less
Submitted 26 September, 2012;
originally announced September 2012.
-
Diversifying Citation Recommendations
Authors:
Onur Küçüktunç,
Erik Saule,
Kamer Kaya,
Ümit V. Çatalyürek
Abstract:
Literature search is arguably one of the most important phases of the academic and non-academic research. The increase in the number of published papers each year makes manual search inefficient and furthermore insufficient. Hence, automatized methods such as search engines have been of interest in the last thirty years. Unfortunately, these traditional engines use keyword-based approaches to solv…
▽ More
Literature search is arguably one of the most important phases of the academic and non-academic research. The increase in the number of published papers each year makes manual search inefficient and furthermore insufficient. Hence, automatized methods such as search engines have been of interest in the last thirty years. Unfortunately, these traditional engines use keyword-based approaches to solve the search problem, but these approaches are prone to ambiguity and synonymy. On the other hand, bibliographic search techniques based only on the citation information are not prone to these problems since they do not consider textual similarity. For many particular research areas and topics, the amount of knowledge to humankind is immense, and obtaining the desired information is as hard as looking for a needle in a haystack. Furthermore, sometimes, what we are looking for is a set of documents where each one is different than the others, but at the same time, as a whole we want them to cover all the important parts of the literature relevant to our search. This paper targets the problem of result diversification in citation-based bibliographic search. It surveys a set of techniques which aim to find a set of papers with satisfactory quality and diversity. We enhance these algorithms with a direction-awareness functionality to allow the users to reach either old, well-cited, well-known research papers or recent, less-known ones. We also propose a set of novel techniques for a better diversification of the results. All the techniques considered are compared by performing a rigorous experimentation. The results show that some of the proposed techniques are very successful in practice while performing a search in a bibliographic database.
△ Less
Submitted 25 September, 2012;
originally announced September 2012.
-
Recommendation on Academic Networks using Direction Aware Citation Analysis
Authors:
Onur Küçüktunç,
Erik Saule,
Kamer Kaya,
Ümit V. Çatalyürek
Abstract:
The literature search has always been an important part of an academic research. It greatly helps to improve the quality of the research process and output, and increase the efficiency of the researchers in terms of their novel contribution to science. As the number of published papers increases every year, a manual search becomes more exhaustive even with the help of today's search engines since…
▽ More
The literature search has always been an important part of an academic research. It greatly helps to improve the quality of the research process and output, and increase the efficiency of the researchers in terms of their novel contribution to science. As the number of published papers increases every year, a manual search becomes more exhaustive even with the help of today's search engines since they are not specialized for this task. In academics, two relevant papers do not always have to share keywords, cite one another, or even be in the same field. Although a well-known paper is usually an easy pray in such a hunt, relevant papers using a different terminology, especially recent ones, are not obvious to the eye.
In this work, we propose paper recommendation algorithms by using the citation information among papers. The proposed algorithms are direction aware in the sense that they can be tuned to find either recent or traditional papers. The algorithms require a set of papers as input and recommend a set of related ones. If the user wants to give negative or positive feedback on the suggested paper set, the recommendation is refined. The search process can be easily guided in that sense by relevance feedback. We show that this slight guidance helps the user to reach a desired paper in a more efficient way. We adapt our models and algorithms also for the venue and reviewer recommendation tasks. Accuracy of the models and algorithms is thoroughly evaluated by comparison with multiple baselines and algorithms from the literature in terms of several objectives specific to citation, venue, and reviewer recommendation tasks. All of these algorithms are implemented within a publicly available web-service framework (http://theadvisor.osu.edu/) which currently uses the data from DBLP and CiteSeer to construct the proposed citation graph.
△ Less
Submitted 5 May, 2012;
originally announced May 2012.
-
Load-Balancing Spatially Located Computations using Rectangular Partitions
Authors:
Erik Saule,
Erdeniz Ö. Baş,
Ümit V. Çatalyürek
Abstract:
Distributing spatially located heterogeneous workloads is an important problem in parallel scientific computing. We investigate the problem of partitioning such workloads (represented as a matrix of non-negative integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since finding the optimal arbitrary rectangle-based partition is an NP-hard problem, we…
▽ More
Distributing spatially located heterogeneous workloads is an important problem in parallel scientific computing. We investigate the problem of partitioning such workloads (represented as a matrix of non-negative integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since finding the optimal arbitrary rectangle-based partition is an NP-hard problem, we investigate particular classes of solutions: rectilinear, jagged and hierarchical. We present a new class of solutions called m-way jagged partitions, propose new optimal algorithms for m-way jagged partitions and hierarchical partitions, propose new heuristic algorithms, and provide worst case performance analyses for some existing and new heuristics. Moreover, the algorithms are tested in simulation on a wide set of instances. Results show that two of the algorithms we introduce lead to a much better load balance than the state-of-the-art algorithms. We also show how to design a two-phase algorithm that reaches different time/quality tradeoff.
△ Less
Submitted 13 April, 2011;
originally announced April 2011.
-
Hypergraph Partitioning through Vertex Separators on Graphs
Authors:
Enver Kayaaslan,
Ali Pinar,
Umit V. Catalyurek,
Cevdet Aykanat
Abstract:
The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computing. The modeling flexibility of hypergraphs however, comes at a cost: algorithms on hypergraphs are inherently…
▽ More
The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computing. The modeling flexibility of hypergraphs however, comes at a cost: algorithms on hypergraphs are inherently more complicated than those on graphs, which sometimes translate to nontrivial increases in processing times. Neither the modeling flexibility of hypergraphs, nor the runtime efficiency of graph algorithms can be overlooked. Therefore, the new research thrust should be how to cleverly trade-off between the two. This work addresses one method for this trade-off by solving the hypergraph partitioning problem by finding vertex separators on graphs. Specifically, we investigate how to solve the hypergraph partitioning problem by seeking a vertex separator on its net intersection graph (NIG), where each net of the hypergraph is represented by a vertex, and two vertices share an edge if their nets have a common vertex. We propose a vertex-weighting scheme to attain good node-balanced hypergraphs, since NIG model cannot preserve node balancing information. Vertex-removal and vertex-splitting techniques are described to optimize cutnet and connectivity metrics, respectively, under the recursive bipartitioning paradigm. We also developed an implementation for our GPVS-based HP formulations by adopting and modifying a state-of-the-art GPVS tool onmetis. Experiments conducted on a large collection of sparse matrices confirmed the validity of our proposed techniques.
△ Less
Submitted 1 March, 2011;
originally announced March 2011.