Search | arXiv e-print repository

Algorithms for Large-scale Network Analysis and the NetworKit Toolkit

Authors: Eugenio Angriman, Alexander van der Grinten, Michael Hamann, Henning Meyerhenke, Manuel Penschuck

Abstract: The abundance of massive network data in a plethora of applications makes scalable analysis algorithms and software tools necessary to generate knowledge from such data in reasonable time. Addressing scalability as well as other requirements such as good usability and a rich feature set, the open-source software NetworKit has established itself as a popular tool for large-scale network analysis. T… ▽ More The abundance of massive network data in a plethora of applications makes scalable analysis algorithms and software tools necessary to generate knowledge from such data in reasonable time. Addressing scalability as well as other requirements such as good usability and a rich feature set, the open-source software NetworKit has established itself as a popular tool for large-scale network analysis. This chapter provides a brief overview of the contributions to NetworKit made by the DFG Priority Programme SPP 1736 Algorithms for Big Data. Algorithmic contributions in the areas of centrality computations, community detection, and sparsification are in the focus, but we also mention several other aspects -- such as current software engineering principles of the project and ways to visualize network data within a NetworKit-based workflow. △ Less

Submitted 20 September, 2022; originally announced September 2022.

arXiv:2203.01263 [pdf, other]

Interactive Visualization of Protein RINs using NetworKit in the Cloud

Authors: Eugenio Angriman, Fabian Brandt-Tumescheit, Leon Franke, Alexander van der Grinten, Henning Meyerhenke

Abstract: Network analysis has been applied in diverse application domains. In this paper, we consider an example from protein dynamics, specifically residue interaction networks (RINs). In this context, we use NetworKit -- an established package for network analysis -- to build a cloud-based environment that enables domain scientists to run their visualization and analysis workflows on large compute server… ▽ More Network analysis has been applied in diverse application domains. In this paper, we consider an example from protein dynamics, specifically residue interaction networks (RINs). In this context, we use NetworKit -- an established package for network analysis -- to build a cloud-based environment that enables domain scientists to run their visualization and analysis workflows on large compute servers, without requiring extensive programming and/or system administration knowledge. To demonstrate the versatility of this approach, we use it to build a custom Jupyter-based widget for RIN visualization. In contrast to existing RIN visualization approaches, our widget can easily be customized through simple modifications of Python code, while both supporting a good feature set and providing near real-time speed. It is also easily integrated into analysis pipelines (e.g., that use Python to feed RIN data into downstream machine learning tasks). △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2202.08808 [pdf, other]

Fast Dynamic Updates and Dynamic SpGEMM on MPI-Distributed Graphs

Authors: Alexander van der Grinten, Geert Custers, Duy Le Thanh, Henning Meyerhenke

Abstract: Sparse matrix multiplication (SpGEMM) is a fundamental kernel used in many diverse application areas, both numerical and discrete. For example, many algebraic graph algorithms rely on SpGEMM in the tropical semiring to compute shortest paths in graphs. Recently, SpGEMM has received growing attention regarding implementations for specific (parallel) architectures. Yet, this concerns only the static… ▽ More Sparse matrix multiplication (SpGEMM) is a fundamental kernel used in many diverse application areas, both numerical and discrete. For example, many algebraic graph algorithms rely on SpGEMM in the tropical semiring to compute shortest paths in graphs. Recently, SpGEMM has received growing attention regarding implementations for specific (parallel) architectures. Yet, this concerns only the static problem, where both input matrices do not change. In many applications, however, matrices (or their corresponding graphs) change over time. Although recomputing from scratch is very expensive, we are not aware of any dynamic SpGEMM algorithms in the literature. In this paper, we thus propose a batch-dynamic algorithm for MPI-based parallel computing. Building on top of a distributed graph/matrix data structure that allows for fast updates, our dynamic SpGEMM reduces the communication volume significantly. It does so by exploiting that updates change far fewer matrix entries than there are non-zeros in the input operands. Our experiments with popular benchmark graphs show that our approach pays off. For batches of insertions or removals of matrix entries, our dynamic SpGEMM is substantially faster than the static algorithms in the state-of-the-art competitors CombBLAS, CTF and PETSc. △ Less

Submitted 31 May, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

Comments: various updates

arXiv:2101.06192 [pdf, other]

New Approximation Algorithms for Forest Closeness Centrality -- for Individual Vertices and Vertex Groups

Authors: Alexander van der Grinten, Eugenio Angriman, Maria Predari, Henning Meyerhenke

Abstract: The emergence of massive graph data sets requires fast mining algorithms. Centrality measures to identify important vertices belong to the most popular analysis methods in graph mining. A measure that is gaining attention is forest closeness centrality; it is closely related to electrical measures using current flow but can also handle disconnected graphs. Recently, [** et al., ICDM'19] proposed… ▽ More The emergence of massive graph data sets requires fast mining algorithms. Centrality measures to identify important vertices belong to the most popular analysis methods in graph mining. A measure that is gaining attention is forest closeness centrality; it is closely related to electrical measures using current flow but can also handle disconnected graphs. Recently, [** et al., ICDM'19] proposed an algorithm to approximate this measure probabilistically. Their algorithm processes small inputs quickly, but does not scale well beyond hundreds of thousands of vertices. In this paper, we first propose a different approximation algorithm; it is up to two orders of magnitude faster and more accurate in practice. Our method exploits the strong connection between uniform spanning trees and forest distances by adapting and extending recent approximation algorithms for related single-vertex problems. This results in a nearly-linear time algorithm with an absolute probabilistic error guarantee. In addition, we are the first to consider the problem of finding an optimal group of vertices w.r.t. forest closeness. We prove that this latter problem is NP-hard; to approximate it, we adapt a greedy algorithm by [Li et al., WWW'19], which is based on (partial) matrix inversion. Moreover, our experiments show that on disconnected graphs, group forest closeness outperforms existing centrality measures in the context of semi-supervised vertex classification. △ Less

Submitted 15 January, 2021; originally announced January 2021.

arXiv:2010.15435 [pdf, other]

Group-Harmonic and Group-Closeness Maximization -- Approximation and Engineering

Authors: Eugenio Angriman, Ruben Becker, Gianlorenzo D'Angelo, Hugo Gilbert, Alexander van der Grinten, Henning Meyerhenke

Abstract: Centrality measures characterize important nodes in networks. Efficiently computing such nodes has received a lot of attention. When considering the generalization of computing central groups of nodes, challenging optimization problems occur. In this work, we study two such problems, group-harmonic maximization and group-closeness maximization both from a theoretical and from an algorithm engineer… ▽ More Centrality measures characterize important nodes in networks. Efficiently computing such nodes has received a lot of attention. When considering the generalization of computing central groups of nodes, challenging optimization problems occur. In this work, we study two such problems, group-harmonic maximization and group-closeness maximization both from a theoretical and from an algorithm engineering perspective. On the theoretical side, we obtain the following results. For group-harmonic maximization, unless $P=NP$, there is no polynomial-time algorithm that achieves an approximation factor better than $1-1/e$ (directed) and $1-1/(4e)$ (undirected), even for unweighted graphs. On the positive side, we show that a greedy algorithm achieves an approximation factor of $λ(1-2/e)$ (directed) and $λ(1-1/e)/2$ (undirected), where $λ$ is the ratio of minimal and maximal edge weights. For group-closeness maximization, the undirected case is $NP$-hard to be approximated to within a factor better than $1-1/(e+1)$ and a constant approximation factor is achieved by a local-search algorithm. For the directed case, however, we show that, for any $ε<1/2$, the problem is $NP$-hard to be approximated within a factor of $4|V|^{-ε}$. From the algorithm engineering perspective, we provide efficient implementations of the above greedy and local search algorithms. In our experimental study we show that, on small instances where an optimum solution can be computed in reasonable time, the quality of both the greedy and the local search algorithms come very close to the optimum. On larger instances, our local search algorithms yield results with superior quality compared to existing greedy and local search solutions, at the cost of additional running time. We thus advocate local search for scenarios where solution quality is of highest concern. △ Less

Submitted 29 October, 2020; originally announced October 2020.

arXiv:2006.13679 [pdf, other]

Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis

Authors: Eugenio Angriman, Maria Predari, Alexander van der Grinten, Henning Meyerhenke

Abstract: The ubiquity of massive graph data sets in numerous applications requires fast algorithms for extracting knowledge from these data. We are motivated here by three electrical measures for the analysis of large small-world graphs $G = (V, E)$ -- i.e., graphs with diameter in $O(\log |V|)$, which are abundant in complex network analysis. From a computational point of view, the three measures have in… ▽ More The ubiquity of massive graph data sets in numerous applications requires fast algorithms for extracting knowledge from these data. We are motivated here by three electrical measures for the analysis of large small-world graphs $G = (V, E)$ -- i.e., graphs with diameter in $O(\log |V|)$, which are abundant in complex network analysis. From a computational point of view, the three measures have in common that their crucial component is the diagonal of the graph Laplacian's pseudoinverse, $L^\dagger$. Computing diag$(L^\dagger)$ exactly by pseudoinversion, however, is as expensive as dense matrix multiplication -- and the standard tools in practice even require cubic time. Moreover, the pseudoinverse requires quadratic space -- hardly feasible for large graphs. Resorting to approximation by, e.g., using the Johnson-Lindenstrauss transform, requires the solution of $O(\log |V| / ε^2)$ Laplacian linear systems to guarantee a relative error, which is still very expensive for large inputs. In this paper, we present a novel approximation algorithm that requires the solution of only one Laplacian linear system. The remaining parts are purely combinatorial -- mainly sampling uniform spanning trees, which we relate to diag$(L^\dagger)$ via effective resistances. For small-world networks, our algorithm obtains a $\pm ε$-approximation with high probability, in a time that is nearly-linear in $|E|$ and quadratic in $1 / ε$. Another positive aspect of our algorithm is its parallel nature due to independent sampling. We thus provide two parallel implementations of our algorithm: one using OpenMP, one MPI + OpenMP. In our experiments against the state of the art, our algorithm (i) yields more accurate results, (ii) is much faster and more memory-efficient, and (iii) obtains good parallel speedups, in particular in the distributed setting. △ Less

Submitted 8 February, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

arXiv:2001.07134 [pdf, other]

High-Quality Hierarchical Process Map**

Authors: Marcelo Fonseca Faraj, Alexander van der Grinten, Henning Meyerhenke, Jesper Larsson Träff, Christian Schulz

Abstract: Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that… ▽ More Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process map**. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Experiments indicate that our algorithms speed up the overall map** process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields better solutions than the previous state-of-the-art in terms of map** quality while being a factor 62 faster. Compared to the currently fastest iterated multilevel map** algorithm Scotch, we obtain 16% better solutions while investing slightly more running time. △ Less

Submitted 22 January, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

arXiv:1911.03360 [pdf, other]

Local Search for Group Closeness Maximization on Big Graphs

Authors: Eugenio Angriman, Alexander van der Grinten, Henning Meyerhenke

Abstract: In network analysis and graph mining, closeness centrality is a popular measure to infer the importance of a vertex. Computing closeness efficiently for individual vertices received considerable attention. The NP-hard problem of group closeness maximization, in turn, is more challenging: the objective is to find a vertex group that is central as a whole and state-of-the-art heuristics for it do no… ▽ More In network analysis and graph mining, closeness centrality is a popular measure to infer the importance of a vertex. Computing closeness efficiently for individual vertices received considerable attention. The NP-hard problem of group closeness maximization, in turn, is more challenging: the objective is to find a vertex group that is central as a whole and state-of-the-art heuristics for it do not scale to very big graphs yet. In this paper, we present new local search heuristics for group closeness maximization. By using randomized approximation techniques and dynamic data structures, our algorithms are often able to perform locally optimal decisions efficiently. The final result is a group with high (but not optimal) closeness centrality. We compare our algorithms to the current state-of-the-art greedy heuristic both on weighted and on unweighted real-world graphs. For graphs with hundreds of millions of edges, our local search algorithms take only around ten minutes, while greedy requires more than ten hours. Overall, our new algorithms are between one and two orders of magnitude faster, depending on the desired group size and solution quality. For example, on weighted graphs and $k = 10$, our algorithms yield solutions of $12,4\%$ higher quality, while also being $793,6\times$ faster. For unweighted graphs and $k = 10$, we achieve solutions within $99,4\%$ of the state-of-the-art quality while being $127,8\times$ faster. △ Less

Submitted 8 November, 2019; originally announced November 2019.

arXiv:1910.13874 [pdf, other]

Group Centrality Maximization for Large-scale Graphs

Authors: Eugenio Angriman, Alexander van der Grinten, Aleksandar Bojchevski, Daniel Zügner, Stephan Günnemann, Henning Meyerhenke

Abstract: The study of vertex centrality measures is a key aspect of network analysis. Naturally, such centrality measures have been generalized to groups of vertices; for popular measures it was shown that the problem of finding the most central group is $\mathcal{NP}$-hard. As a result, approximation algorithms to maximize group centralities were introduced recently. Despite a nearly-linear running time,… ▽ More The study of vertex centrality measures is a key aspect of network analysis. Naturally, such centrality measures have been generalized to groups of vertices; for popular measures it was shown that the problem of finding the most central group is $\mathcal{NP}$-hard. As a result, approximation algorithms to maximize group centralities were introduced recently. Despite a nearly-linear running time, approximation algorithms for group betweenness and (to a lesser extent) group closeness are rather slow on large networks due to high constant overheads. That is why we introduce GED-Walk centrality, a new submodular group centrality measure inspired by Katz centrality. In contrast to closeness and betweenness, it considers walks of any length rather than shortest paths, with shorter walks having a higher contribution. We define algorithms that (i) efficiently approximate the GED-Walk score of a given group and (ii) efficiently approximate the (proved to be $\mathcal{NP}$-hard) problem of finding a group with highest GED-Walk score. Experiments on several real-world datasets show that scores obtained by GED-Walk improve performance on common graph mining tasks such as collective classification and graph-level classification. An evaluation of empirical running times demonstrates that maximizing GED-Walk (in approximation) is two orders of magnitude faster compared to group betweenness approximation and for group sizes $\leq 100$ one to two orders faster than group closeness approximation. For graphs with tens of millions of edges, approximate GED-Walk maximization typically needs less than one minute. Furthermore, our experiments suggest that the maximization algorithms scale linearly with the size of the input graph and the size of the group. △ Less

Submitted 30 October, 2019; originally announced October 2019.

arXiv:1910.11039 [pdf, other]

Scaling Betweenness Approximation to Billions of Edges by MPI-based Adaptive Sampling

Authors: Alexander van der Grinten, Henning Meyerhenke

Abstract: Betweenness centrality is one of the most popular vertex centrality measures in network analysis. Hence, many (sequential and parallel) algorithms to compute or approximate betweenness have been devised. Recent algorithmic advances have made it possible to approximate betweenness very efficiently on shared-memory architectures. Yet, the best shared-memory algorithms can still take hours of running… ▽ More Betweenness centrality is one of the most popular vertex centrality measures in network analysis. Hence, many (sequential and parallel) algorithms to compute or approximate betweenness have been devised. Recent algorithmic advances have made it possible to approximate betweenness very efficiently on shared-memory architectures. Yet, the best shared-memory algorithms can still take hours of running time for large graphs, especially for graphs with a high diameter or when a small relative error is required. In this work, we present an MPI-based generalization of the state-of-the-art shared-memory algorithm for betweenness approximation. This algorithm is based on adaptive sampling; our parallelization strategy can be applied in the same manner to adaptive sampling algorithms for other problems. In experiments on a 16-node cluster, our MPI-based implementation is by a factor of 16.1x faster than the state-of-the-art shared-memory implementation when considering our parallelization focus -- the adaptive sampling phase -- only. For the complete algorithm, we obtain an average (geom. mean) speedup factor of 7.4x over the state of the art. For some previously very challenging inputs, this speedup is much higher. As a result, our algorithm is the first to approximate betweenness centrality on graphs with several billion edges in less than ten minutes with high accuracy. △ Less

Submitted 24 October, 2019; originally announced October 2019.

arXiv:1904.04690 [pdf, other]

Guidelines for Experimental Algorithmics in Network Analysis

Authors: Eugenio Angriman, Alexander van der Grinten, Moritz von Looz, Henning Meyerhenke, Martin Nöllenburg, Maria Predari, Charilaos Tzovas

Abstract: The field of network science is a highly interdisciplinary area; for the empirical analysis of network data, it draws algorithmic methodologies from several research fields. Hence, research procedures and descriptions of the technical results often differ, sometimes widely. In this paper we focus on methodologies for the experimental part of algorithm engineering for network analysis -- an importa… ▽ More The field of network science is a highly interdisciplinary area; for the empirical analysis of network data, it draws algorithmic methodologies from several research fields. Hence, research procedures and descriptions of the technical results often differ, sometimes widely. In this paper we focus on methodologies for the experimental part of algorithm engineering for network analysis -- an important ingredient for a research area with empirical focus. More precisely, we unify and adapt existing recommendations from different fields and propose universal guidelines -- including statistical analyses -- for the systematic evaluation of network analysis algorithms. This way, the behavior of newly proposed algorithms can be properly assessed and comparisons to existing solutions become meaningful. Moreover, as the main technical contribution, we provide SimexPal, a highly automated tool to perform and analyze experiments following our guidelines. To illustrate the merits of SimexPal and our guidelines, we apply them in a case study: we design, perform, visualize and evaluate experiments of a recent algorithm for approximating betweenness centrality, an important problem in network analysis. In summary, both our guidelines and SimexPal shall modernize and complement previous efforts in experimental algorithmics; they are not only useful for network analysis, but also in related contexts. △ Less

Submitted 25 March, 2019; originally announced April 2019.

arXiv:1903.09422 [pdf, other]

Parallel Adaptive Sampling with almost no Synchronization

Authors: Alexander van der Grinten, Eugenio Angriman, Henning Meyerhenke

Abstract: Approximation via sampling is a widespread technique whenever exact solutions are too expensive. In this paper, we present techniques for an efficient parallelization of adaptive (a. k. a. progressive) sampling algorithms on multi-threaded shared-memory machines. Our basic algorithmic technique requires no synchronization except for atomic load-acquire and store-release operations. It does, howeve… ▽ More Approximation via sampling is a widespread technique whenever exact solutions are too expensive. In this paper, we present techniques for an efficient parallelization of adaptive (a. k. a. progressive) sampling algorithms on multi-threaded shared-memory machines. Our basic algorithmic technique requires no synchronization except for atomic load-acquire and store-release operations. It does, however, require O(n) memory per thread, where n is the size of the sampling state. We present variants of the algorithm that either reduce this memory consumption to O(1) or ensure that deterministic results are obtained. Using the KADABRA algorithm for betweenness centrality (a popular measure in network analysis) approximation as a case study, we demonstrate the empirical performance of our techniques. In particular, on a 32-core machine, our best algorithm is 2.9x faster than what we could achieve using a straightforward OpenMP-based parallelization and 65.3x faster than the existing implementation of KADABRA. △ Less

Submitted 22 March, 2019; originally announced March 2019.

arXiv:1807.03847 [pdf, other]

Scalable Katz Ranking Computation in Large Static and Dynamic Graphs

Authors: Alexander van der Grinten, Elisabetta Bergamini, Oded Green, David A. Bader, Henning Meyerhenke

Abstract: Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing rankings for Katz centrality. In particular, we prop… ▽ More Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing rankings for Katz centrality. In particular, we propose upper and lower bounds on the Katz score of a given node. While previous approaches relied on numerical approximation or heuristics to compute Katz centrality rankings, we construct an algorithm that iteratively improves those upper and lower bounds until a correct Katz ranking is obtained. We extend our algorithm to dynamic graphs while maintaining its correctness guarantees. Experiments demonstrate that our static graph algorithm outperforms both numerical approaches and heuristics with speedups between 1.5x and 3.5x, depending on the desired quality guarantees. Our dynamic graph algorithm improves upon the static algorithm for update batches of less than 10000 edges. We provide efficient parallel CPU and GPU implementations of our algorithms that enable near real-time Katz centrality computation for graphs with hundreds of millions of nodes in fractions of seconds. △ Less

Submitted 10 July, 2018; originally announced July 2018.

Comments: Published at ESA'18

arXiv:1310.4756 [pdf, other]

Effectiveness of pre- and inprocessing for CDCL-based SAT solving

Authors: Andreas Wotzlaw, Alexander van der Grinten, Ewald Speckenmeyer

Abstract: Applying pre- and inprocessing techniques to simplify CNF formulas both before and during search can considerably improve the performance of modern SAT solvers. These algorithms mostly aim at reducing the number of clauses, literals, and variables in the formula. However, to be worthwhile, it is necessary that their additional runtime does not exceed the runtime saved during the subsequent SAT sol… ▽ More Applying pre- and inprocessing techniques to simplify CNF formulas both before and during search can considerably improve the performance of modern SAT solvers. These algorithms mostly aim at reducing the number of clauses, literals, and variables in the formula. However, to be worthwhile, it is necessary that their additional runtime does not exceed the runtime saved during the subsequent SAT solver execution. In this paper we investigate the efficiency and the practicability of selected simplification algorithms for CDCL-based SAT solving. We first analyze them by means of their expected impact on the CNF formula and SAT solving at all. While testing them on real-world and combinatorial SAT instances, we show which techniques and combinations of them yield a desirable speedup and which ones should be avoided. △ Less

Submitted 17 October, 2013; originally announced October 2013.

Comments: 9 pages, 4 figures

Showing 1–14 of 14 results for author: van der Grinten, A