-
Mining Maximal Induced Bicliques using Odd Cycle Transversals
Authors:
Kyle Kloster,
Blair D. Sullivan,
Andrew van der Poel
Abstract:
Many common graph data mining tasks take the form of identifying dense subgraphs (e.g. clustering, clique-finding, etc). In biological applications, the natural model for these dense substructures is often a complete bipartite graph (biclique), and the problem requires enumerating all maximal bicliques (instead of just identifying the largest or densest). The best known algorithm in general graphs…
▽ More
Many common graph data mining tasks take the form of identifying dense subgraphs (e.g. clustering, clique-finding, etc). In biological applications, the natural model for these dense substructures is often a complete bipartite graph (biclique), and the problem requires enumerating all maximal bicliques (instead of just identifying the largest or densest). The best known algorithm in general graphs is due to Dias et al., and runs in time O(M |V|^4 ), where M is the number of maximal induced bicliques (MIBs) in the graph. When the graph being searched is itself bipartite, Zhang et al. give a faster algorithm where the time per MIB depends on the number of edges in the graph. In this work, we present a new algorithm for enumerating MIBs in general graphs, whose run time depends on how "close to bipartite" the input is. Specifically, the runtime is parameterized by the size k of an odd cycle transversal (OCT), a vertex set whose deletion results in a bipartite graph. Our algorithm runs in time O(M |V||E|k^2 3^(k/3) ), which is an improvement on Dias et al. whenever k <= 3log_3(|V|). We implement our algorithm alongside a variant of Dias et al.'s in open-source C++ code, and experimentally verify that the OCT-based approach is faster in practice on graphs with a wide variety of sizes, densities, and OCT decompositions.
△ Less
Submitted 24 January, 2019; v1 submitted 26 October, 2018;
originally announced October 2018.
-
Structural Rounding: Approximation Algorithms for Graphs Near an Algorithmically Tractable Class
Authors:
Erik D. Demaine,
Timothy D. Goodrich,
Kyle Kloster,
Brian Lavallee,
Quanquan C. Liu,
Blair D. Sullivan,
Ali Vakilian,
Andrew van der Poel
Abstract:
We develop a new framework for generalizing approximation algorithms from the structural graph algorithm literature so that they apply to graphs somewhat close to that class (a scenario we expect is common when working with real-world networks) while still guaranteeing approximation ratios. The idea is to $\textit{edit}$ a given graph via vertex- or edge-deletions to put the graph into an algorith…
▽ More
We develop a new framework for generalizing approximation algorithms from the structural graph algorithm literature so that they apply to graphs somewhat close to that class (a scenario we expect is common when working with real-world networks) while still guaranteeing approximation ratios. The idea is to $\textit{edit}$ a given graph via vertex- or edge-deletions to put the graph into an algorithmically tractable class, apply known approximation algorithms for that class, and then $\textit{lift}$ the solution to apply to the original graph. We give a general characterization of when an optimization problem is amenable to this approach, and show that it includes many well-studied graph problems, such as Independent Set, Vertex Cover, Feedback Vertex Set, Minimum Maximal Matching, Chromatic Number, ($\ell$-)Dominating Set, Edge ($\ell$-)Dominating Set, and Connected Dominating Set.
To enable this framework, we develop new editing algorithms that find the approximately-fewest edits required to bring a given graph into one of several important graph classes (in some cases, also approximating the target parameter of the family). For bounded degeneracy, we obtain a bicriteria $(4,4)$-approximation which also extends to a smoother bicriteria trade-off. For bounded treewidth, we obtain a bicriteria $(O(\log^{1.5} n), O(\sqrt{\log w}))$-approximation, and for bounded pathwidth, we obtain a bicriteria $(O(\log^{1.5} n), O(\sqrt{\log w} \cdot \log n))$-approximation. For treedepth $2$ (also related to bounded expansion), we obtain a $4$-approximation. We also prove complementary hardness-of-approximation results assuming $\mathrm{P} \neq \mathrm{NP}$: in particular, these problems are all log-factor inapproximable, except the last which is not approximable below some constant factor ($2$ assuming UGC).
△ Less
Submitted 9 December, 2018; v1 submitted 7 June, 2018;
originally announced June 2018.
-
Subgraph centrality and walk-regularity
Authors:
Eric Horton,
Kyle Kloster,
Blair D. Sullivan
Abstract:
Matrix-based centrality measures have enjoyed significant popularity in network analysis, in no small part due to our ability to rigorously analyze their behavior as parameters vary. Recent work has considered the relationship between subgraph centrality, which is defined using the matrix exponential $f(x) = \exp(x)$, and the walk structure of a network. In a walk-regular graph, the number of clos…
▽ More
Matrix-based centrality measures have enjoyed significant popularity in network analysis, in no small part due to our ability to rigorously analyze their behavior as parameters vary. Recent work has considered the relationship between subgraph centrality, which is defined using the matrix exponential $f(x) = \exp(x)$, and the walk structure of a network. In a walk-regular graph, the number of closed walks of each length must be the same for all nodes, implying uniform $f$-subgraph centralities for any $f$ (or maximum $f$-$\textit{walk entropy}$). We consider when non--walk-regular graphs can achieve maximum entropy, calling such graphs $\textit{entropic}$. For parameterized measures, we are also interested in which values of the parameter witness this uniformity. To date, only one entropic graph has been identified, with only two witnessing parameter values, raising the question of how many such graphs and parameters exist. We resolve these questions by constructing infinite families of entropic graphs, as well as a family of witnessing parameters with a limit point at zero.
△ Less
Submitted 4 February, 2019; v1 submitted 16 April, 2018;
originally announced April 2018.
-
A practical fpt algorithm for Flow Decomposition and transcript assembly
Authors:
Kyle Kloster,
Philipp Kuinke,
Michael P. O'Brien,
Felix Reidl,
Fernando Sánchez Villaamil,
Blair D. Sullivan,
Andrew van der Poel
Abstract:
The Flow Decomposition problem, which asks for the smallest set of weighted paths that "covers" a flow on a DAG, has recently been used as an important computational step in transcript assembly. We prove the problem is in FPT when parameterized by the number of paths by giving a practical linear fpt algorithm. Further, we implement and engineer a Flow Decomposition solver based on this algorithm,…
▽ More
The Flow Decomposition problem, which asks for the smallest set of weighted paths that "covers" a flow on a DAG, has recently been used as an important computational step in transcript assembly. We prove the problem is in FPT when parameterized by the number of paths by giving a practical linear fpt algorithm. Further, we implement and engineer a Flow Decomposition solver based on this algorithm, and evaluate its performance on RNA-sequence data. Crucially, our solver finds exact solutions while achieving runtimes competitive with a state-of-the-art heuristic. Finally, we contextualize our design choices with two hardness results related to preprocessing and weight recovery. Specifically, $k$-Flow Decomposition does not admit polynomial kernels under standard complexity assumptions, and the related problem of assigning (known) weights to a given set of paths is NP-hard.
△ Less
Submitted 30 August, 2017; v1 submitted 23 June, 2017;
originally announced June 2017.
-
Scalable and Robust Local Community Detection via Adaptive Subgraph Extraction and Diffusions
Authors:
Kyle Kloster,
Yixuan Li
Abstract:
Local community detection, the problem of identifying a set of relevant nodes nearby a small set of input seed nodes, is an important graph primitive with a wealth of applications and research activity. Recent approaches include using local spectral information, graph diffusions, and random walks to determine a community from input seeds. As networks grow to billions of nodes and exhibit diverse s…
▽ More
Local community detection, the problem of identifying a set of relevant nodes nearby a small set of input seed nodes, is an important graph primitive with a wealth of applications and research activity. Recent approaches include using local spectral information, graph diffusions, and random walks to determine a community from input seeds. As networks grow to billions of nodes and exhibit diverse structures, it is important that community detection algorithms are not only efficient, but also robust to different structural features.
Toward this goal, we explore pre-processing techniques and modifications to existing local methods aimed at improving the scalability and robustness of algorithms related to community detection. Experiments show that our modifications improve both speed and quality of existing methods for locating ground truth communities, and are more robust across graphs and communities of varying sizes, densities, and diameters. Our subgraph extraction method uses adaptively selected PageRank parameters to improve on the recall and runtime of a walk-based pre-processing technique of Li et al. for extracting subgraphs before searching for a community. We then use this technique to enable the first scalable implementation of the recent Local Fiedler method of Mahoney et al. Our experimental evaluation shows our pre-processed version of Local Fiedler, as well as our novel simplification of the LEMON community detection framework of Li et al., offer significant speedups over their predecessors and obtain cluster quality competitive with the state of the art.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
AptRank: An Adaptive PageRank Model for Protein Function Prediction on Bi-relational Graphs
Authors:
Biaobin Jiang,
Kyle Kloster,
David F. Gleich,
Michael Gribskov
Abstract:
Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood- and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the…
▽ More
Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood- and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a function-function similarity kernel. No study has taken the GO hierarchy into account together with the protein network as a two-layer network model.
We first construct a Bi-relational graph (Birg) model comprised of both protein-protein association and function-function hierarchical networks. We then propose two diffusion-based methods, BirgRank and AptRank, both of which use PageRank to diffuse information on this two-layer graph model. BirgRank is an application of traditional PageRank with fixed decay parameters. In contrast, AptRank uses an adaptive mechanism to improve the performance of BirgRank. We evaluate both methods in predicting protein function on yeast, fly, and human datasets, and compare with four previous methods: GeneMANIA, TMC, ProteinRank and clusDCA. We design three validation strategies: missing function prediction, de novo function prediction, and guided function prediction to comprehensively evaluate all six methods. We find that both BirgRank and AptRank outperform the others, especially in missing function prediction when using only 10% of the data for training.
AptRank combines protein-protein associations and the GO function-function hierarchy into a two-layer network model without flattening the hierarchy into a similarity kernel. Introducing an adaptive mechanism to the traditional, fixed-parameter model of PageRank greatly improves the accuracy of protein function prediction.
△ Less
Submitted 22 May, 2016; v1 submitted 20 January, 2016;
originally announced January 2016.
-
Localization in Seeded PageRank
Authors:
David F. Gleich,
Kyle Kloster,
Huda Nassar
Abstract:
Seeded PageRank is an important network analysis tool for identifying and studying regions nearby a given set of nodes, which are called seeds. The seeded PageRank vector is the stationary distribution of a random walk that randomly resets at the seed nodes. Intuitively, this vector is concentrated nearby the given seeds, but is mathematically non-zero for all nodes in a connected graph. We study…
▽ More
Seeded PageRank is an important network analysis tool for identifying and studying regions nearby a given set of nodes, which are called seeds. The seeded PageRank vector is the stationary distribution of a random walk that randomly resets at the seed nodes. Intuitively, this vector is concentrated nearby the given seeds, but is mathematically non-zero for all nodes in a connected graph. We study this concentration, or localization, and show a sublinear upper bound on the number of entries required to approximate seeded PageRank on all graphs with a natural type of skewed-degree sequence---similar to those that arise in many real-world networks. Experiments with both real-world and synthetic graphs give further evidence to the idea that the degree sequence of a graph has a major influence on the localization behavior of seeded PageRank. Moreover, we establish that this localization is non-trivial by showing that complete-bipartite graphs produce seeded PageRank vectors that cannot be approximated with a sublinear number of non-zeros.
△ Less
Submitted 22 May, 2017; v1 submitted 31 August, 2015;
originally announced September 2015.
-
Seeded PageRank Solution Paths
Authors:
Kyle Kloster,
David F. Gleich
Abstract:
We study the behavior of network diffusions based on the PageRank random walk from a set of seed nodes. These diffusions are known to reveal small, localized clusters (or communities) and also large macro-scale clusters by varying a parameter that has a dual-interpretation as an accuracy bound and as a regularization level. We propose a new method that quickly approximates the result of the diffus…
▽ More
We study the behavior of network diffusions based on the PageRank random walk from a set of seed nodes. These diffusions are known to reveal small, localized clusters (or communities) and also large macro-scale clusters by varying a parameter that has a dual-interpretation as an accuracy bound and as a regularization level. We propose a new method that quickly approximates the result of the diffusion for all values of this parameter. Our method efficiently generates an approximate $\textit{solution path}$ or $\textit{regularization path}$ associated with a PageRank diffusion, and it reveals cluster structures at multiple size-scales between small and large. We formally prove a runtime bound on this method that is independent of the size of the network, and we investigate multiple optimizations to our method that can be more practical in some settings. We demonstrate that these methods identify refined clustering structure on a number of real-world networks with up to 2 billion edges.
△ Less
Submitted 11 December, 2015; v1 submitted 1 March, 2015;
originally announced March 2015.
-
Heat kernel based community detection
Authors:
Kyle Kloster,
David F. Gleich
Abstract:
The heat kernel is a particular type of graph diffusion that, like the much-used personalized PageRank diffusion, is useful in identifying a community nearby a starting seed node. We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. Our algorithm is formally a relaxation method for solving a linear system to…
▽ More
The heat kernel is a particular type of graph diffusion that, like the much-used personalized PageRank diffusion, is useful in identifying a community nearby a starting seed node. We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. Our algorithm is formally a relaxation method for solving a linear system to estimate the matrix exponential in a degree-weighted norm. We prove that this algorithm stays localized in a large graph and has a worst-case constant runtime that depends only on the parameters of the diffusion, not the size of the graph. Our experiments on real-world networks indicate that the communities produced by this method have better conductance than those produced by PageRank, although they take slightly longer to compute on large graphs. On a real-world community identification task, the heat kernel communities perform better than those from the PageRank diffusion.
△ Less
Submitted 15 November, 2016; v1 submitted 12 March, 2014;
originally announced March 2014.
-
Sublinear Column-wise Actions of the Matrix Exponential on Social Networks
Authors:
Kyle Kloster,
David F. Gleich
Abstract:
We consider stochastic transition matrices from large social and information networks. For these matrices, we describe and evaluate three fast methods to estimate one column of the matrix exponential. The methods are designed to exploit the properties inherent in social networks, such as a power-law degree distribution. Using only this property, we prove that one of our algorithms has a sublinear…
▽ More
We consider stochastic transition matrices from large social and information networks. For these matrices, we describe and evaluate three fast methods to estimate one column of the matrix exponential. The methods are designed to exploit the properties inherent in social networks, such as a power-law degree distribution. Using only this property, we prove that one of our algorithms has a sublinear runtime. We present further experimental evidence showing that all of them run quickly on social networks with billions of edges and accurately identify the largest elements of the column.
△ Less
Submitted 1 March, 2015; v1 submitted 12 October, 2013;
originally announced October 2013.