Search | arXiv e-print repository

A Sublinear Algorithm for Approximate Shortest Paths in Large Networks

Authors: Sabyasachi Basu, Nadia Kōshima, Talya Eden, Omri Ben-Eliezer, C. Seshadhri

Abstract: Computing distances and finding shortest paths in massive real-world networks is a fundamental algorithmic task in network analysis. There are two main approaches to solving this task. On one hand are traversal-based algorithms like bidirectional breadth-first search (BiBFS) with no preprocessing step and slow individual distance inquiries. On the other hand are indexing-based approaches, which ma… ▽ More Computing distances and finding shortest paths in massive real-world networks is a fundamental algorithmic task in network analysis. There are two main approaches to solving this task. On one hand are traversal-based algorithms like bidirectional breadth-first search (BiBFS) with no preprocessing step and slow individual distance inquiries. On the other hand are indexing-based approaches, which maintain a large index. This allows for answering individual inquiries very fast; however, index creation is prohibitively expensive. We seek to bridge these two extremes: quickly answer distance inquiries without the need for costly preprocessing. In this work, we propose a new algorithm and data structure, WormHole, for approximate shortest path computations. WormHole leverages structural properties of social networks to build a sublinearly sized index, drawing upon the explicit core-periphery decomposition of Ben-Eliezer et al. Empirically, the preprocessing time of WormHole improves upon index-based solutions by orders of magnitude, and individual inquiries are consistently much faster than in BiBFS. The acceleration comes at the cost of a minor accuracy trade-off. Nonetheless, our empirical evidence demonstrates that WormHole accurately answers essentially all inquiries within a maximum additive error of 2. We complement these empirical results with provable theoretical guarantees, showing that WormHole requires $n^{o(1)}$ node queries per distance inquiry in random power-law networks. In contrast, any approach without a preprocessing step requires $n^{Ω(1)}$ queries for the same task. WormHole does not require reading the whole graph. Unlike the vast majority of index-based algorithms, it returns paths, not just distances. For faster inquiry times, it can be combined effectively with other index-based solutions, by running them only on the sublinear core. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.00262 [pdf, other]

Improved Massively Parallel Triangle Counting in $O(1)$ Rounds

Authors: Quanquan C. Liu, C. Seshadhri

Abstract: In this short note, we give a novel algorithm for $O(1)$ round triangle counting in bounded arboricity graphs. Counting triangles in $O(1)$ rounds (exactly) is listed as one of the interesting remaining open problems in the recent survey of Im et al. [IKLMV23]. The previous paper of Biswas et al. [BELMR20], which achieved the best bounds under this setting, used $O(\log \log n)$ rounds in sublinea… ▽ More In this short note, we give a novel algorithm for $O(1)$ round triangle counting in bounded arboricity graphs. Counting triangles in $O(1)$ rounds (exactly) is listed as one of the interesting remaining open problems in the recent survey of Im et al. [IKLMV23]. The previous paper of Biswas et al. [BELMR20], which achieved the best bounds under this setting, used $O(\log \log n)$ rounds in sublinear space per machine and $O(mα)$ total space where $α$ is the arboricity of the graph and $n$ and $m$ are the number of vertices and edges in the graph, respectively. Our new algorithm is very simple, achieves the optimal $O(1)$ rounds without increasing the space per machine and the total space, and has the potential of being easily implementable in practice. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: To appear in PODC 2024

arXiv:2311.09584 [pdf, other]

A Dichotomy Hierarchy Characterizing Linear Time Subgraph Counting in Bounded Degeneracy Graphs

Authors: Daniel Paul-Pena, C. Seshadhri

Abstract: Subgraph and homomorphism counting are fundamental algorithmic problems. Given a constant-sized pattern graph $H$ and a large input graph $G$, we wish to count the number of $H$-homomorphisms/subgraphs in $G$. Given the massive sizes of real-world graphs and the practical importance of counting problems, we focus on when (near) linear time algorithms are possible. The seminal work of Chiba-Nishize… ▽ More Subgraph and homomorphism counting are fundamental algorithmic problems. Given a constant-sized pattern graph $H$ and a large input graph $G$, we wish to count the number of $H$-homomorphisms/subgraphs in $G$. Given the massive sizes of real-world graphs and the practical importance of counting problems, we focus on when (near) linear time algorithms are possible. The seminal work of Chiba-Nishizeki (SICOMP 1985) shows that for bounded degeneracy graphs $G$, clique and $4$-cycle counting can be done linear time. Recent works (Bera et al, SODA 2021, JACM 2022) show a dichotomy theorem characterizing the patterns $H$ for which $H$-homomorphism counting is possible in linear time, for bounded degeneracy inputs $G$. At the other end, Nešetřil and Ossona de Mendez used their deep theory of "sparsity" to define bounded expansion graphs. They prove that, for all $H$, $H$-homomorphism counting can be done in linear time for bounded expansion inputs. What lies between? For a specific $H$, can we characterize input classes where $H$-homomorphism counting is possible in linear time? We discover a hierarchy of dichotomy theorems that precisely answer the above questions. We show the existence of an infinite sequence of graph classes $\mathcal{G}_0$ $\supseteq$ $\mathcal{G}_1$ $\supseteq$ ... $\supseteq$ $\mathcal{G}_\infty$ where $\mathcal{G}_0$ is the class of bounded degeneracy graphs, and $\mathcal{G}_\infty$ is the class of bounded expansion graphs. Fix any constant sized pattern graph $H$. Let $LICL(H)$ denote the length of the longest induced cycle in $H$. We prove the following. If $LICL(H) < 3(r+2)$, then $H$-homomorphisms can be counted in linear time for inputs in $\mathcal{G}_r$. If $LICL(H) \geq 3(r+2)$, then $H$-homomorphism counting on inputs from $\mathcal{G}_r$ takes $Ω(m^{1+γ})$ time. We prove similar dichotomy theorems for subgraph counting. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2304.01416 [pdf, other]

A $d^{1/2+o(1)}$ Monotonicity Tester for Boolean Functions on $d$-Dimensional Hypergrids

Authors: Hadley Black, Deeparnab Chakrabarty, C. Seshadhri

Abstract: Monotonicity testing of Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$, is a classic topic in property testing. Determining the non-adaptive complexity of this problem is an important open question. For arbitrary $n$, [Black-Chakrabarty-Seshadhri, SODA 2020] describe a tester with query complexity $\widetilde{O}(\varepsilon^{-4/3}d^{5/6})$. This complexity is independent of $n$, but has… ▽ More Monotonicity testing of Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$, is a classic topic in property testing. Determining the non-adaptive complexity of this problem is an important open question. For arbitrary $n$, [Black-Chakrabarty-Seshadhri, SODA 2020] describe a tester with query complexity $\widetilde{O}(\varepsilon^{-4/3}d^{5/6})$. This complexity is independent of $n$, but has a suboptimal dependence on $d$. Recently, [Braverman-Khot-Kindler-Minzer, ITCS 2023] and [Black-Chakrabarty-Seshadhri, STOC 2023] describe $\widetilde{O}(\varepsilon^{-2} n^3\sqrt{d})$ and $\widetilde{O}(\varepsilon^{-2} n\sqrt{d})$-query testers, respectively. These testers have an almost optimal dependence on $d$, but a suboptimal polynomial dependence on $n$. In this paper, we describe a non-adaptive, one-sided monotonicity tester with query complexity $O(\varepsilon^{-2} d^{1/2 + o(1)})$, independent of $n$. Up to the $d^{o(1)}$-factors, our result resolves the non-adaptive complexity of monotonicity testing for Boolean functions on hypergrids. The independence of $n$ yields a non-adaptive, one-sided $O(\varepsilon^{-2} d^{1/2 + o(1)})$-query monotonicity tester for Boolean functions $f:\mathbb{R}^d \to \{0,1\}$ associated with an arbitrary product measure. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.14550 [pdf, other]

Theoretical bounds on the network community profile from low-rank semi-definite programming

Authors: Yufan Huang, C. Seshadhri, David F. Gleich

Abstract: We study a new connection between a technical measure called $μ$-conductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes size-resolved properties of clusters and communities in social and information networks. The idea of $μ$-conductance is similar to the traditional graph conductance, but disregards sets with small volu… ▽ More We study a new connection between a technical measure called $μ$-conductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes size-resolved properties of clusters and communities in social and information networks. The idea of $μ$-conductance is similar to the traditional graph conductance, but disregards sets with small volume. We derive a sequence of optimization problems including a low-rank semi-definite program from which we can derive a lower bound on the optimal $μ$-conductance value. These ideas give the first theoretically sound bound on the behavior of the network community profile for a wide range of cluster sizes. The algorithm scales up to graphs with hundreds of thousands of nodes and we demonstrate how our framework validates the predicted structures of real-world graphs. △ Less

Submitted 25 March, 2023; originally announced March 2023.

arXiv:2211.08605 [pdf, other]

A Dichotomy Theorem for Linear Time Homomorphism Orbit Counting in Bounded Degeneracy Graphs

Authors: Daniel Paul-Pena, C. Seshadhri

Abstract: Counting the number of homomorphisms of a pattern graph H in a large input graph G is a fundamental problem in computer science. There are myriad applications of this problem in databases, graph algorithms, and network science. Often, we need more than just the total count. Especially in large network analysis, we wish to compute, for each vertex v of G, the number of H-homomorphisms that v partic… ▽ More Counting the number of homomorphisms of a pattern graph H in a large input graph G is a fundamental problem in computer science. There are myriad applications of this problem in databases, graph algorithms, and network science. Often, we need more than just the total count. Especially in large network analysis, we wish to compute, for each vertex v of G, the number of H-homomorphisms that v participates in. This problem is referred to as homomorphism orbit counting, as it relates to the orbits of vertices of H under its automorphisms. Given the need for fast algorithms for this problem, we study when near-linear time algorithms are possible. A natural restriction is to assume that the input graph G has bounded degeneracy, a commonly observed property in modern massive networks. Can we characterize the patterns H for which homomorphism orbit counting can be done in linear time? We discover a dichotomy theorem that resolves this problem. For pattern H, let l be the length of the longest induced path between any two vertices of the same orbit (under the automorphisms of H). If l <= 5, then H-homomorphism orbit counting can be done in linear time for bounded degeneracy graphs. If l > 5, then (assuming fine-grained complexity conjectures) there is no near-linear time algorithm for this problem. We build on existing work on dichotomy theorems for counting the total H-homomorphism count. Somewhat surprisingly, there exist (and we characterize) patterns H for which the total homomorphism count can be computed in linear time, but the corresponding orbit counting problem cannot be done in near-linear time. △ Less

Submitted 15 November, 2022; originally announced November 2022.

arXiv:2211.06352 [pdf, other]

Spectral Triadic Decompositions of Real-World Networks

Authors: Sabyasachi Basu, Suman Kalyan Bera, C. Seshadhri

Abstract: A fundamental problem in mathematics and network analysis is to find conditions under which a graph can be partitioned into smaller pieces. The most important tool for this partitioning is the Fiedler vector or discrete Cheeger inequality. These results relate the graph spectrum (eigenvalues of the normalized adjacency matrix) to the ability to break a graph into two pieces, with few edge deletion… ▽ More A fundamental problem in mathematics and network analysis is to find conditions under which a graph can be partitioned into smaller pieces. The most important tool for this partitioning is the Fiedler vector or discrete Cheeger inequality. These results relate the graph spectrum (eigenvalues of the normalized adjacency matrix) to the ability to break a graph into two pieces, with few edge deletions. An entire subfield of mathematics, called spectral graph theory, has emerged from these results. Yet these results do not say anything about the rich community structure exhibited by real-world networks, which typically have a significant fraction of edges contained in numerous densely clustered blocks. Inspired by the properties of real-world networks, we discover a new spectral condition that relates eigenvalue powers to a network decomposition into densely clustered blocks. We call this the \emph{spectral triadic decomposition}. Our relationship exactly predicts the existence of community structure, as commonly seen in real networked data. Our proof provides an efficient algorithm to produce the spectral triadic decomposition. We observe on numerous social, coauthorship, and citation network datasets that these decompositions have significant correlation with semantically meaningful communities. △ Less

Submitted 8 May, 2024; v1 submitted 11 November, 2022; originally announced November 2022.

arXiv:2211.05281 [pdf, other]

Directed Isoperimetric Theorems for Boolean Functions on the Hypergrid and an $\widetilde{O}(n\sqrt{d})$ Monotonicity Tester

Authors: Hadley Black, Deeparnab Chakrabarty, C. Seshadhri

Abstract: The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of Khot-Minzer-Safra (FOCS 2015) gave a non-adaptive, one-sided tester making $\widetilde{O}(\varepsilon^{-2}\sqrt{d})$ queries. Up to polylog $d$ and $\varepsilon$ factors, t… ▽ More The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of Khot-Minzer-Safra (FOCS 2015) gave a non-adaptive, one-sided tester making $\widetilde{O}(\varepsilon^{-2}\sqrt{d})$ queries. Up to polylog $d$ and $\varepsilon$ factors, this bound matches the $\widetildeΩ(\sqrt{d})$-query non-adaptive lower bound (Chen-De-Servedio-Tan (STOC 2015), Chen-Waingarten-Xie (STOC 2017)). For any $n > 2$, the optimal non-adaptive complexity was unknown. A previous result of the authors achieves a $\widetilde{O}(d^{5/6})$-query upper bound (SODA 2020), quite far from the $\sqrt{d}$ bound for the hypercube. In this paper, we resolve the non-adaptive complexity of monotonicity testing for all constant $n$, up to $\text{poly}(\varepsilon^{-1}\log d)$ factors. Specifically, we give a non-adaptive, one-sided monotonicity tester making $\widetilde{O}(\varepsilon^{-2}n\sqrt{d})$ queries. From a technical standpoint, we prove new directed isoperimetric theorems over the hypergrid $[n]^d$. These results generalize the celebrated directed Talagrand inequalities that were only known for the hypercube. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2201.08481 [pdf, other]

Classic Graph Structural Features Outperform Factorization-Based Graph Embedding Methods on Community Labeling

Authors: Andrew Stolman, Caleb Levy, C. Seshadhri, Aneesh Sharma

Abstract: Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a low-dimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine lea… ▽ More Graph representation learning (also called graph embeddings) is a popular technique for incorporating network structure into machine learning models. Unsupervised graph embedding methods aim to capture graph structure by learning a low-dimensional vector representation (the embedding) for each node. Despite the widespread use of these embeddings for a variety of downstream transductive machine learning tasks, there is little principled analysis of the effectiveness of this approach for common tasks. In this work, we provide an empirical and theoretical analysis for the performance of a class of embeddings on the common task of pairwise community labeling. This is a binary variant of the classic community detection problem, which seeks to build a classifier to determine whether a pair of vertices participate in a community. In line with our goal of foundational understanding, we focus on a popular class of unsupervised embedding techniques that learn low rank factorizations of a vertex proximity matrix (this class includes methods like GraRep, DeepWalk, node2vec, NetMF). We perform detailed empirical analysis for community labeling over a variety of real and synthetic graphs with ground truth. In all cases we studied, the models trained from embedding features perform poorly on community labeling. In constrast, a simple logistic model with classic graph structural features handily outperforms the embedding models. For a more principled understanding, we provide a theoretical analysis for the (in)effectiveness of these embeddings in capturing the community structure. We formally prove that popular low-dimensional factorization methods either cannot produce community structure, or can only produce ``unstable" communities. These communities are inherently unstable under small perturbations. △ Less

Submitted 20 January, 2022; originally announced January 2022.

arXiv:2108.10547 [pdf, ps, other]

doi 10.1137/1.9781611977073.69

The complexity of testing all properties of planar graphs, and the role of isomorphism

Authors: Sabyasachi Basu, Akash Kumar, C. Seshadhri

Abstract: Consider property testing on bounded degree graphs and let $\varepsilon>0$ denote the proximity parameter. A remarkable theorem of Newman-Sohler (SICOMP 2013) asserts that all properties of planar graphs (more generally hyperfinite) are testable with query complexity only depending on $\varepsilon$. Recent advances in testing minor-freeness have proven that all additive and monotone properties of… ▽ More Consider property testing on bounded degree graphs and let $\varepsilon>0$ denote the proximity parameter. A remarkable theorem of Newman-Sohler (SICOMP 2013) asserts that all properties of planar graphs (more generally hyperfinite) are testable with query complexity only depending on $\varepsilon$. Recent advances in testing minor-freeness have proven that all additive and monotone properties of planar graphs can be tested in $poly(\varepsilon^{-1})$ queries. Some properties falling outside this class, such as Hamiltonicity, also have a similar complexity for planar graphs. Motivated by these results, we ask: can all properties of planar graphs can be tested in $poly(\varepsilon^{-1})$ queries? Is there a uniform query complexity upper bound for all planar properties, and what is the "hardest" such property to test? We discover a surprisingly clean and optimal answer. Any property of bounded degree planar graphs can be tested in $\exp(O(\varepsilon^{-2}))$ queries. Moreover, there is a matching lower bound, up to constant factors in the exponent. The natural property of testing isomorphism to a fixed graph needs $\exp(Ω(\varepsilon^{-2}))$ queries, thereby showing that (up to polynomial dependencies) isomorphism to an explicit fixed graph is the hardest property of planar graphs. The upper bound is a straightforward adapation of the Newman-Sohler analysis that tracks dependencies on $\varepsilon$ carefully. The main technical contribution is the lower bound construction, which is achieved by a special family of planar graphs that are all mutually far from each other. We can also apply our techniques to get analogous results for bounded treewidth graphs. We prove that all properties of bounded treewidth graphs can be tested in $\exp(O(\varepsilon^{-1}\log \varepsilon^{-1}))$ queries. Moreover, testing isomorphism to a fixed forest requires $\exp(Ω(\varepsilon^{-1}))$ queries. △ Less

Submitted 25 August, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

Journal ref: Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 1702-1714

arXiv:2106.02762 [pdf, other]

Faster and Generalized Temporal Triangle Counting, via Degeneracy Ordering

Authors: Noujan Pashanasangi, C. Seshadhri

Abstract: Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many real-world graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the tim… ▽ More Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many real-world graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the timestamps. Temporal triangle counting has seen a few recent results, but there are varying definitions of temporal triangles. In all cases, temporal triangle patterns enforce constraints on the time interval between edges (in the triangle). We define a general notion $(δ_{1,3}, δ_{1,2}, δ_{2,3})$-temporal triangles that allows for separate time constraints for all pairs of edges. Our main result is a new algorithm, DOTTT (Degeneracy Oriented Temporal Triangle Totaler), that exactly counts all directed variants of $(δ_{1,3}, δ_{1,2}, δ_{2,3})$-temporal triangles. Using the classic idea of degeneracy ordering with careful combinatorial arguments, we can prove that DOTTT runs in $O(mκ\log m)$ time, where $m$ is the number of (temporal) edges and $κ$ is the graph degeneracy (max core number). Up to log factors, this matches the running time of the best static triangle counters. Moreover, this running time is better than existing. DOTTT has excellent practical behavior and runs twice as fast as existing state-of-the-art temporal triangle counters (and is also more general). For example, DOTTT computes all types of temporal queries in Bitcoin temporal network with half a billion edges in less than an hour on a commodity machine. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: To be published in KDD 2021

arXiv:2104.11079 [pdf, other]

doi 10.2172/1807223

Randomized Algorithms for Scientific Computing (RASC)

Authors: Aydin Buluc, Tamara G. Kolda, Stefan M. Wild, Mihai Anitescu, Anthony DeGennaro, John Jakeman, Chandrika Kamath, Ramakrishnan Kannan, Miles E. Lopes, Per-Gunnar Martinsson, Kary Myers, Jelani Nelson, Juan M. Restrepo, C. Seshadhri, Draguna Vrabie, Brendt Wohlberg, Stephen J. Wright, Chao Yang, Peter Zwart

Abstract: Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and sc… ▽ More Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021. △ Less

Submitted 21 March, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

arXiv:2102.00556 [pdf, ps, other]

Random walks and forbidden minors III: poly(d/ε)-time partition oracles for minor-free graph classes

Authors: Akash Kumar, C. Seshadhri, Andrew Stolman

Abstract: Consider the family of bounded degree graphs in any minor-closed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for a sufficiently small \e > 0, one removes \edn edges to get connected components of size independent of n. An important tool for sublinear algorithms and propert… ▽ More Consider the family of bounded degree graphs in any minor-closed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for a sufficiently small \e > 0, one removes \edn edges to get connected components of size independent of n. An important tool for sublinear algorithms and property testing for such classes is the partition oracle, introduced by the seminal work of Hassidim-Kelner-Nguyen-Onak (FOCS 2009). A partition oracle is a local procedure that gives consistent access to a hyperfinite decomposition, without any preprocessing. Given a query vertex v, the partition oracle outputs the component containing v in time independent of n. All the answers are consistent with a single hyperfinite decomposition. The partition oracle of Hassidim et al. runs in time d^poly(d/\e) per query. They pose the open problem of whether poly(d/\e)-time partition oracles exist. Levi-Ron (ICALP 2013) give a refinement of the previous approach, to get a partition oracle that runs in time d^{\log(d/\e)-per query. In this paper, we resolve this open problem and give \poly(d/\e)-time partition oracles for bounded degree graphs in any minor-closed family. Unlike the previous line of work based on combinatorial methods, we employ techniques from spectral graph theory. We build on a recent spectral graph theoretical toolkit for minor-closed graph families, introduced by the authors to develop efficient property testers. A consequence of our result is a poly(d/\e)-query tester for any monotone and additive property of minor-closed families (such as bipartite planar graphs). Our result also gives poly(d/\e)-query algorithms for additive {\e}n-approximations for problems such as maximum matching, minimum vertex cover, maximum independent set, and minimum dominating set for these graph families. △ Less

Submitted 2 May, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

Comments: 31 pages

arXiv:2010.08083 [pdf, ps, other]

Near-Linear Time Homomorphism Counting in Bounded Degeneracy Graphs: The Barrier of Long Induced Cycles

Authors: Suman K. Bera, Noujan Pashanasangi, C. Seshadhri

Abstract: Counting homomorphisms of a constant sized pattern graph $H$ in an input graph $G$ is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input $G$ and the pattern $H$. Given the significance of this problem and the large sizes of modern inputs, we investigate when near-linear time algorithms are possible. We foc… ▽ More Counting homomorphisms of a constant sized pattern graph $H$ in an input graph $G$ is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input $G$ and the pattern $H$. Given the significance of this problem and the large sizes of modern inputs, we investigate when near-linear time algorithms are possible. We focus on the case when the input graph has bounded degeneracy, a commonly studied and practically relevant class for homomorphism counting. It is known from previous work that for certain classes of $H$, $H$-homomorphisms can be counted exactly in near-linear time in bounded degeneracy graphs. Can we precisely characterize the patterns $H$ for which near-linear time algorithms are possible? We completely resolve this problem, discovering a clean dichotomy using fine-grained complexity. Let $m$ denote the number of edges in $G$. We prove the following: if the largest induced cycle in $H$ has length at most $5$, then there is an $O(m\log m)$ algorithm for counting $H$-homomorphisms in bounded degeneracy graphs. If the largest induced cycle in $H$ has length at least $6$, then (assuming standard fine-grained complexity conjectures) there is a constant $γ> 0$, such that there is no $o(m^{1+γ})$ time algorithm for counting $H$-homomorphisms. △ Less

Submitted 18 November, 2020; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: To be published in Symposium on Discrete Algorithms (SODA) 2021 Added conclusion section in the new version

arXiv:2010.05998 [pdf, ps, other]

Counting Subgraphs in Degenerate Graphs

Authors: Suman K. Bera, Lior Gishboliner, Yevgeny Levanzov, C. Seshadhri, Asaf Shapira

Abstract: We consider the problem of counting the number of copies of a fixed graph $H$ within an input graph $G$. This is one of the most well-studied algorithmic graph problems, with many theoretical and practical applications. We focus on solving this problem when the input $G$ has bounded degeneracy. This is a rich family of graphs, containing all graphs without a fixed minor (e.g. planar graphs), as we… ▽ More We consider the problem of counting the number of copies of a fixed graph $H$ within an input graph $G$. This is one of the most well-studied algorithmic graph problems, with many theoretical and practical applications. We focus on solving this problem when the input $G$ has bounded degeneracy. This is a rich family of graphs, containing all graphs without a fixed minor (e.g. planar graphs), as well as graphs generated by various random processes (e.g. preferential attachment graphs). We say that $H$ is easy if there is a linear-time algorithm for counting the number of copies of $H$ in an input $G$ of bounded degeneracy. A seminal result of Chiba and Nishizeki from '85 states that every $H$ on at most 4 vertices is easy. Bera, Pashanasangi, and Seshadhri recently extended this to all $H$ on 5 vertices, and further proved that for every $k > 5$ there is a $k$-vertex $H$ which is not easy. They left open the natural problem of characterizing all easy graphs $H$. Bressan has recently introduced a framework for counting subgraphs in degenerate graphs, from which one can extract a sufficient condition for a graph $H$ to be easy. Here we show that this sufficient condition is also necessary, thus fully answering the Bera--Pashanasangi--Seshadhri problem. We further resolve two closely related problems; namely characterizing the graphs that are easy with respect to counting induced copies, and with respect to counting homomorphisms. △ Less

Submitted 9 December, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

arXiv:2007.15743 [pdf, ps, other]

Distribution-Free Models of Social Networks

Authors: Tim Roughgarden, C. Seshadhri

Abstract: The structure of large-scale social networks has predominantly been articulated using generative models, a form of average-case analysis. This chapter surveys recent proposals of more robust models of such networks. These models posit deterministic and empirically supported combinatorial structure rather than a specific probability distribution. We discuss the formal definitions of these models an… ▽ More The structure of large-scale social networks has predominantly been articulated using generative models, a form of average-case analysis. This chapter surveys recent proposals of more robust models of such networks. These models posit deterministic and empirically supported combinatorial structure rather than a specific probability distribution. We discuss the formal definitions of these models and how they relate to empirical observations in social networks, as well as the known structural and algorithmic results for the corresponding graph classes. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Comments: Chapter 28 of the book Beyond the Worst-Case Analysis of Algorithms, edited by Tim Roughgarden and published by Cambridge University Press (2020)

arXiv:2007.09768 [pdf, other]

FPT Algorithms for Finding Near-Cliques in $c$-Closed Graphs

Authors: Balaram Behera, Edin Husić, Shweta Jain, Tim Roughgarden, C. Seshadhri

Abstract: Finding large cliques or cliques missing a few edges is a fundamental algorithmic task in the study of real-world graphs, with applications in community detection, pattern recognition, and clustering. A number of effective backtracking-based heuristics for these problems have emerged from recent empirical work in social network analysis. Given the NP-hardness of variants of clique counting, these… ▽ More Finding large cliques or cliques missing a few edges is a fundamental algorithmic task in the study of real-world graphs, with applications in community detection, pattern recognition, and clustering. A number of effective backtracking-based heuristics for these problems have emerged from recent empirical work in social network analysis. Given the NP-hardness of variants of clique counting, these results raise a challenge for beyond worst-case analysis of these problems. Inspired by the triadic closure of real-world graphs, Fox et al. (SICOMP 2020) introduced the notion of $c$-closed graphs and proved that maximal clique enumeration is fixed-parameter tractable with respect to $c$. In practice, due to noise in data, one wishes to actually discover "near-cliques", which can be characterized as cliques with a sparse subgraph removed. In this work, we prove that many different kinds of maximal near-cliques can be enumerated in polynomial time (and FPT in $c$) for $c$-closed graphs. We study various established notions of such substructures, including $k$-plexes, complements of bounded-degeneracy and bounded-treewidth graphs. Interestingly, our algorithms follow relatively simple backtracking procedures, analogous to what is done in practice. Our results underscore the significance of the $c$-closed graph class for theoretical understanding of social network analysis. △ Less

Submitted 19 November, 2021; v1 submitted 19 July, 2020; originally announced July 2020.

Comments: Accepted to ITCS 2022

MSC Class: 68W01; 68R10; 05C85

arXiv:2006.13483 [pdf, other]

doi 10.1145/3366423.3380264

Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS

Authors: Shweta Jain, C. Seshadhri

Abstract: Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is i… ▽ More Clique and near-clique counts are important graph properties with applications in graph generation, graph modeling, graph analytics, community detection among others. They are the archetypal examples of dense subgraphs. While there are several different definitions of near-cliques, most of them share the attribute that they are cliques that are missing a small number of edges. Clique counting is itself considered a challenging problem. Counting near-cliques is significantly harder more so since the search space for near-cliques is orders of magnitude larger than that of cliques. We give a formulation of a near-clique as a clique that is missing a constant number of edges. We exploit the fact that a near-clique contains a smaller clique, and use techniques for clique sampling to count near-cliques. This method allows us to count near-cliques with 1 or 2 missing edges, in graphs with tens of millions of edges. To the best of our knowledge, there was no known efficient method for this problem, and we obtain a 10x - 100x speedup over existing algorithms for counting near-cliques. Our main technique is a space-efficient adaptation of the Turán Shadow sampling approach, recently introduced by Jain and Seshadhri (WWW 2017). This approach constructs a large recursion tree (called the Turán Shadow) that represents cliques in a graph. We design a novel algorithm that builds an estimator for near-cliques, using an online, compact construction of the Turán Shadow. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: The Web Conference, 2020 (WWW)

arXiv:2006.11947 [pdf, other]

How to Count Triangles, without Seeing the Whole Graph

Authors: Suman K. Bera, C. Seshadhri

Abstract: Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we… ▽ More Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph? We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model. Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges. △ Less

Submitted 21 June, 2020; originally announced June 2020.

Comments: Accepted for publication in KDD 2020

arXiv:2003.13151 [pdf, ps, other]

How the Degeneracy Helps for Triangle Counting in Graph Streams

Authors: Suman K. Bera, C. Seshadhri

Abstract: We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $Θ(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al… ▽ More We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $Θ(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $κ$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(mκ/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $Ω(mκ/T)$. △ Less

Submitted 29 March, 2020; originally announced March 2020.

Comments: Accepted for publication in PODS'2020

arXiv:2003.12635 [pdf, other]

doi 10.1073/pnas.1911030117

The impossibility of low rank representations for triangle-rich complex networks

Authors: C. Seshadhri, Aneesh Sharma, Andrew Stolman, Ashish Goel

Abstract: The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems to name a few. A widely used algorithmic technique for modeling such complex networks i… ▽ More The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not}capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have rank nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as Singular Value Decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure. △ Less

Submitted 27 March, 2020; originally announced March 2020.

Journal ref: PNAS, March 2020

arXiv:2001.06784 [pdf, other]

doi 10.1145/3336191.3371839

The Power of Pivoting for Exact Clique Counting

Authors: Shweta Jain, C. Seshadhri

Abstract: Clique counting is a fundamental task in network analysis, and even the simplest setting of $3$-cliques (triangles) has been the center of much recent research. Getting the count of $k$-cliques for larger $k$ is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use… ▽ More Clique counting is a fundamental task in network analysis, and even the simplest setting of $3$-cliques (triangles) has been the center of much recent research. Getting the count of $k$-cliques for larger $k$ is algorithmically challenging, due to the exponential blowup in the search space of large cliques. But a number of recent applications (especially for community detection or clustering) use larger clique counts. Moreover, one often desires \textit{local} counts, the number of $k$-cliques per vertex/edge. Our main result is Pivoter, an algorithm that exactly counts the number of $k$-cliques, \textit{for all values of $k$}. It is surprisingly effective in practice, and is able to get clique counts of graphs that were beyond the reach of previous work. For example, Pivoter gets all clique counts in a social network with a 100M edges within two hours on a commodity machine. Previous parallel algorithms do not terminate in days. Pivoter can also feasibly get local per-vertex and per-edge $k$-clique counts (for all $k$) for many public data sets with tens of millions of edges. To the best of our knowledge, this is the first algorithm that achieves such results. The main insight is the construction of a Succinct Clique Tree (SCT) that stores a compressed unique representation of all cliques in an input graph. It is built using a technique called \textit{pivoting}, a classic approach by Bron-Kerbosch to reduce the recursion tree of backtracking algorithms for maximal cliques. Remarkably, the SCT can be built without actually enumerating all cliques, and provides a succinct data structure from which exact clique statistics ($k$-clique counts, local counts) can be read off efficiently. △ Less

Submitted 19 January, 2020; originally announced January 2020.

Comments: 10 pages, WSDM 2020

arXiv:1911.10616 [pdf, other]

Efficiently Counting Vertex Orbits of All 5-vertex Subgraphs, by EVOKE

Authors: Noujan Pashanasangi, C. Seshadhri

Abstract: Subgraph counting is a fundamental task in network analysis. Typically, algorithmic work is on total counting, where we wish to count the total frequency of a (small) pattern subgraph in a large input data set. But many applications require local counts (also called vertex orbit counts) wherein, for every vertex $v$ of the input graph, one needs the count of the pattern subgraph involving $v$. Thi… ▽ More Subgraph counting is a fundamental task in network analysis. Typically, algorithmic work is on total counting, where we wish to count the total frequency of a (small) pattern subgraph in a large input data set. But many applications require local counts (also called vertex orbit counts) wherein, for every vertex $v$ of the input graph, one needs the count of the pattern subgraph involving $v$. This provides a rich set of vertex features that can be used in machine learning tasks, especially classification and clustering. But getting local counts is extremely challenging. Even the easier problem of getting total counts has received much research attention. Local counts require algorithms that get much finer grained information, and the sheer output size makes it difficult to design scalable algorithms. We present EVOKE, a scalable algorithm that can determine vertex orbits counts for all 5-vertex pattern subgraphs. In other words, EVOKE exactly determines, for every vertex $v$ of the input graph and every 5-vertex subgraph $H$, the number of copies of $H$ that $v$ participates in. EVOKE can process graphs with tens of millions of edges, within an hour on a commodity machine. EVOKE is typically hundreds of times faster than previous state of the art algorithms, and gets results on datasets beyond the reach of previous methods. Theoretically, we generalize a recent "graph cutting" framework to get vertex orbit counts. This framework generate a collection of polynomial equations relating vertex orbit counts of larger subgraphs to those of smaller subgraphs. EVOKE carefully exploits the structure among these equations to rapidly count. We prove and empirically validate that EVOKE only has a small constant factor overhead over the best (total) 5-vertex subgraph counter. △ Less

Submitted 12 December, 2019; v1 submitted 24 November, 2019; originally announced November 2019.

Comments: We replaced the previous version with the full version

arXiv:1911.05896 [pdf, ps, other]

Linear Time Subgraph Counting, Graph Degeneracy, and the Chasm at Size Six

Authors: Suman K. Bera, Noujan Pashanasangi, C. Seshadhri

Abstract: We consider the problem of counting all $k$-vertex subgraphs in an input graph, for any constant $k$. This problem (denoted sub-cnt$_k$) has been studied extensively in both theory and practice. In a classic result, Chiba and Nishizeki (SICOMP 85) gave linear time algorithms for clique and 4-cycle counting for bounded degeneracy graphs. This is a rich class of sparse graphs that contains, for exam… ▽ More We consider the problem of counting all $k$-vertex subgraphs in an input graph, for any constant $k$. This problem (denoted sub-cnt$_k$) has been studied extensively in both theory and practice. In a classic result, Chiba and Nishizeki (SICOMP 85) gave linear time algorithms for clique and 4-cycle counting for bounded degeneracy graphs. This is a rich class of sparse graphs that contains, for example, all minor-free families and preferential attachment graphs. The techniques from this result have inspired a number of recent practical algorithms for sub-cnt$_k$. Towards a better understanding of the limits of these techniques, we ask: for what values of $k$ can sub-cnt$_k$ be solved in linear time? We discover a chasm at $k=6$. Specifically, we prove that for $k < 6$, sub-cnt$_k$ can be solved in linear time. Assuming a standard conjecture in fine-grained complexity, we prove that for all $k \geq 6$, sub-cnt$_k$ cannot be solved even in near-linear time. △ Less

Submitted 27 November, 2019; v1 submitted 13 November, 2019; originally announced November 2019.

Comments: The previous version did not handle the case of k=8. We corrected that in this version

arXiv:1904.01055 [pdf, ps, other]

Random walks and forbidden minors II: A $\text{poly}(d\varepsilon^{-1})$-query tester for minor-closed properties of bounded-degree graphs

Authors: Akash Kumar, C. Seshadhri, Andrew Stolman

Abstract: Let $G$ be a graph with $n$ vertices and maximum degree $d$. Fix some minor-closed property $\mathcal{P}$ (such as planarity). We say that $G$ is $\varepsilon$-far from $\mathcal{P}$ if one has to remove $\varepsilon dn$ edges to make it have $\mathcal{P}$. The problem of property testing $\mathcal{P}$ was introduced in the seminal work of Benjamini-Schramm-Shapira (STOC 2008) that gave a tester w… ▽ More Let $G$ be a graph with $n$ vertices and maximum degree $d$. Fix some minor-closed property $\mathcal{P}$ (such as planarity). We say that $G$ is $\varepsilon$-far from $\mathcal{P}$ if one has to remove $\varepsilon dn$ edges to make it have $\mathcal{P}$. The problem of property testing $\mathcal{P}$ was introduced in the seminal work of Benjamini-Schramm-Shapira (STOC 2008) that gave a tester with query complexity triply exponential in $\varepsilon^{-1}$. Levi-Ron (TALG 2015) have given the best tester to date, with a quasipolynomial (in $\varepsilon^{-1}$) query complexity. It is an open problem to get property testers whose query complexity is $\text{poly}(d\varepsilon^{-1})$, even for planarity. In this paper, we resolve this open question. For any minor-closed property, we give a tester with query complexity $d\cdot \text{poly}(\varepsilon^{-1})$. The previous line of work on (independent of $n$, two-sided) testers is primarily combinatorial. Our work, on the other hand, employs techniques from spectral graph theory. This paper is a continuation of recent work of the authors (FOCS 2018) analyzing random walk algorithms that find forbidden minors. △ Less

Submitted 1 April, 2019; originally announced April 2019.

arXiv:1811.04425 [pdf, ps, other]

Faster sublinear approximations of $k$-cliques for low arboricity graphs

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: Given query access to an undirected graph $G$, we consider the problem of computing a $(1\pmε)$-approximation of the number of $k$-cliques in $G$. The standard query model for general graphs allows for degree queries, neighbor queries, and pair queries. Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$ be the number of $k$-cliques. Previous work by Eden, Ron and Seshadhri (S… ▽ More Given query access to an undirected graph $G$, we consider the problem of computing a $(1\pmε)$-approximation of the number of $k$-cliques in $G$. The standard query model for general graphs allows for degree queries, neighbor queries, and pair queries. Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$ be the number of $k$-cliques. Previous work by Eden, Ron and Seshadhri (STOC 2018) gives an $O^*(\frac{n}{n^{1/k}_k} + \frac{m^{k/2}}{n_k})$-time algorithm for this problem (we use $O^*(\cdot)$ to suppress $\poly(\log n, 1/ε, k^k)$ dependencies). Moreover, this bound is nearly optimal when the expression is sublinear in the size of the graph. Our motivation is to circumvent this lower bound, by parameterizing the complexity in terms of \emph{graph arboricity}. The arboricity of $G$ is a measure for the graph density "everywhere". We design an algorithm for the class of graphs with arboricity at most $α$, whose running time is $O^*(\min\{\frac{nα^{k-1}}{n_k},\, \frac{n}{n_k^{1/k}}+\frac{m α^{k-2}}{n_k} \})$. We also prove a nearly matching lower bound. For all graphs, the arboricity is $O(\sqrt m)$, so this bound subsumes all previous results on sublinear clique approximation. As a special case of interest, consider minor-closed families of graphs, which have constant arboricity. Our result implies that for any minor-closed family of graphs, there is a $(1\pmε)$-approximation algorithm for $n_k$ that has running time $O^*(\frac{n}{n_k})$. Such a bound was not known even for the special (classic) case of triangle counting in planar graphs. △ Less

Submitted 11 November, 2018; originally announced November 2018.

arXiv:1811.01427 [pdf, other]

Domain Reduction for Monotonicity Testing: A $o(d)$ Tester for Boolean Functions in $d$-Dimensions

Authors: Hadley Black, Deeparnab Chakrabarty, C. Seshadhri

Abstract: We describe a $\tilde{O}(d^{5/6})$-query monotonicity tester for Boolean functions $f:[n]^d \to \{0,1\}$ on the $n$-hypergrid. This is the first $o(d)$ monotonicity tester with query complexity independent of $n$. Motivated by this independence of $n$, we initiate the study of monotonicity testing of measurable Boolean functions $f:\mathbb{R}^d \to \{0,1\}$ over the continuous domain, where the di… ▽ More We describe a $\tilde{O}(d^{5/6})$-query monotonicity tester for Boolean functions $f:[n]^d \to \{0,1\}$ on the $n$-hypergrid. This is the first $o(d)$ monotonicity tester with query complexity independent of $n$. Motivated by this independence of $n$, we initiate the study of monotonicity testing of measurable Boolean functions $f:\mathbb{R}^d \to \{0,1\}$ over the continuous domain, where the distance is measured with respect to a product distribution over $\mathbb{R}^d$. We give a $\tilde{O}(d^{5/6})$-query monotonicity tester for such functions. Our main technical result is a domain reduction theorem for monotonicity. For any function $f:[n]^d \to \{0,1\}$, let $ε_f$ be its distance to monotonicity. Consider the restriction $\hat{f}$ of the function on a random $[k]^d$ sub-hypergrid of the original domain. We show that for $k = \text{poly}(d/ε)$, the expected distance of the restriction is $\mathbb{E}[ε_{\hat{f}}] = Ω(ε_f)$. Previously, such a result was only known for $d=1$ (Berman-Raskhodnikova-Yaroslavtsev, STOC 2014). Our result for testing Boolean functions over $[n]^d$ then follows by applying the $d^{5/6}\cdot \text{poly}(1/ε,\log n, \log d)$-query hypergrid tester of Black-Chakrabarty-Seshadhri (SODA 2018). To obtain the result for testing Boolean functions over $\mathbb{R}^d$, we use standard measure theoretic tools to reduce monotonicity testing of a measurable function $f$ to monotonicity testing of a discretized version of $f$ over a hypergrid domain $[N]^d$ for large, but finite, $N$ (that may depend on $f$). The independence of $N$ in the hypergrid tester is crucial to getting the final tester over $\mathbb{R}^d$. △ Less

Submitted 9 December, 2019; v1 submitted 4 November, 2018; originally announced November 2018.

arXiv:1805.08187 [pdf, ps, other]

Finding forbidden minors in sublinear time: a $n^{1/2+o(1)}$-query one-sided tester for minor closed properties on bounded degree graphs

Authors: Akash Kumar, C. Seshadhri, Andrew Stolman

Abstract: Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\varepsilon n$ edges from $G$ to make it $H$-minor free (for some small constant $\varepsilon > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. As an application, suppose one must remove $\varepsilon n$ edge… ▽ More Let $G$ be an undirected, bounded degree graph with $n$ vertices. Fix a finite graph $H$, and suppose one must remove $\varepsilon n$ edges from $G$ to make it $H$-minor free (for some small constant $\varepsilon > 0$). We give an $n^{1/2+o(1)}$-time randomized procedure that, with high probability, finds an $H$-minor in such a graph. As an application, suppose one must remove $\varepsilon n$ edges from a bounded degree graph $G$ to make it planar. This result implies an algorithm, with the same running time, that produces a $K_{3,3}$ or $K_5$ minor in $G$. No prior sublinear time bound was known for this problem. By the graph minor theorem, we get an analogous result for any minor-closed property. Up to $n^{o(1)}$ factors, this resolves a conjecture of Benjamini-Schramm-Shapira (STOC 2008) on the existence of one-sided property testers for minor-closed properties. Furthermore, our algorithm is nearly optimal, by an $Ω(\sqrt{n})$ lower bound of Czumaj et al (RSA 2014). Prior to this work, the only graphs $H$ for which non-trivial one-sided property testers were known for $H$-minor freeness are the following: $H$ being a forest or a cycle (Czumaj et al, RSA 2014), $K_{2,k}$, $(k\times 2)$-grid, and the $k$-circus (Fichtenberger et al, Arxiv 2017). △ Less

Submitted 27 August, 2018; v1 submitted 21 May, 2018; originally announced May 2018.

Comments: 31 pages

arXiv:1804.07431 [pdf, other]

Finding Cliques in Social Networks: A New Distribution-Free Model

Authors: Jacob Fox, Tim Roughgarden, C. Seshadhri, Fan Wei, Nicole Wein

Abstract: We propose a new distribution-free model of social networks. Our definitions are motivated by one of the most universal signatures of social networks, triadic closure---the property that pairs of vertices with common neighbors tend to be adjacent. Our most basic definition is that of a "$c$-closed" graph, where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$ and $v$ are ad… ▽ More We propose a new distribution-free model of social networks. Our definitions are motivated by one of the most universal signatures of social networks, triadic closure---the property that pairs of vertices with common neighbors tend to be adjacent. Our most basic definition is that of a "$c$-closed" graph, where for every pair of vertices $u,v$ with at least $c$ common neighbors, $u$ and $v$ are adjacent. We study the classic problem of enumerating all maximal cliques, an important task in social network analysis. We prove that this problem is fixed-parameter tractable with respect to $c$ on $c$-closed graphs. Our results carry over to "weakly $c$-closed graphs", which only require a vertex deletion ordering that avoids pairs of non-adjacent vertices with $c$ common neighbors. Numerical experiments show that well-studied social networks tend to be weakly $c$-closed for modest values of $c$. △ Less

Submitted 19 April, 2018; originally announced April 2018.

Comments: main text 13 pages; 2 figures; appendix 9 pages

MSC Class: 68W01; 68R10; 05C85; 05D99

arXiv:1801.02816 [pdf, ps, other]

Adaptive Boolean Monotonicity Testing in Total Influence Time

Authors: Deeparnab Chakrabarty, C. Seshadhri

Abstract: The problem of testing monotonicity of a Boolean function $f:\{0,1\}^n \to \{0,1\}$ has received much attention recently. Denoting the proximity parameter by $\varepsilon$, the best tester is the non-adaptive $\widetilde{O}(\sqrt{n}/\varepsilon^2)$ tester of Khot-Minzer-Safra (FOCS 2015). Let $I(f)$ denote the total influence of $f$. We give an adaptive tester whose running time is… ▽ More The problem of testing monotonicity of a Boolean function $f:\{0,1\}^n \to \{0,1\}$ has received much attention recently. Denoting the proximity parameter by $\varepsilon$, the best tester is the non-adaptive $\widetilde{O}(\sqrt{n}/\varepsilon^2)$ tester of Khot-Minzer-Safra (FOCS 2015). Let $I(f)$ denote the total influence of $f$. We give an adaptive tester whose running time is $I(f)poly(\varepsilon^{-1}\log n)$. △ Less

Submitted 9 January, 2018; originally announced January 2018.

arXiv:1710.10545 [pdf, other]

A $o(d) \cdot \text{polylog}~n$ Monotonicity Tester for Boolean Functions over the Hypergrid $[n]^d$

Authors: Hadley Black, Deeparnab Chakrabarty, C. Seshadhri

Abstract: We study monotonicity testing of Boolean functions over the hypergrid $[n]^d$ and design a non-adaptive tester with $1$-sided error whose query complexity is $\tilde{O}(d^{5/6})\cdot \text{poly}(\log n,1/ε)$. Previous to our work, the best known testers had query complexity linear in $d$ but independent of $n$. We improve upon these testers as long as $n = 2^{d^{o(1)}}$. To obtain our results, w… ▽ More We study monotonicity testing of Boolean functions over the hypergrid $[n]^d$ and design a non-adaptive tester with $1$-sided error whose query complexity is $\tilde{O}(d^{5/6})\cdot \text{poly}(\log n,1/ε)$. Previous to our work, the best known testers had query complexity linear in $d$ but independent of $n$. We improve upon these testers as long as $n = 2^{d^{o(1)}}$. To obtain our results, we work with what we call the augmented hypergrid, which adds extra edges to the hypergrid. Our main technical contribution is a Margulis-style isoperimetric result for the augmented hypergrid, and our tester, like previous testers for the hypercube domain, performs directed random walks on this structure. △ Less

Submitted 28 October, 2017; originally announced October 2017.

arXiv:1710.08607 [pdf, other]

Provable and practical approximations for the degree distribution using sublinear graph samples

Authors: Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, C. Seshadhri

Abstract: The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance… ▽ More The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the $h$-index and the $z$-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1\%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10\%$ of the vertices to give comparable results. We also observe that the $h$ and $z$-indices of real graphs are large, validating our theoretical analysis. △ Less

Submitted 28 August, 2018; v1 submitted 24 October, 2017; originally announced October 2017.

Comments: Longer version of the WWW 2018 submission

arXiv:1707.04858 [pdf, ps, other]

On Approximating the Number of $k$-cliques in Sublinear Time

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $C_k$ the number of $k$-cliques. We design an algorithm that outputs a… ▽ More We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $C_k$ the number of $k$-cliques. We design an algorithm that outputs a $(1+\varepsilon)$-approximation (with high probability) for $C_k$, whose expected query complexity and running time are $O\left(\frac{n}{C_k^{1/k}}+\frac{m^{k/2}}{C_k}\right)\poly(\log n,1/\varepsilon,k)$. Hence, the complexity of the algorithm is sublinear in the size of the graph for $C_k = ω(m^{k/2-1})$. Furthermore, we prove a lower bound showing that the query complexity of our algorithm is essentially optimal (up to the dependence on $\log n$, $1/\varepsilon$ and $k$). The previous results in this vein are by Feige (SICOMP 06) and by Goldreich and Ron (RSA 08) for edge counting ($k=2$) and by Eden et al. (FOCS 2015) for triangle counting ($k=3$). Our result matches the complexities of these results. The previous result by Eden et al. hinges on a certain amortization technique that works only for triangle counting, and does not generalize for larger cliques. We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that samples each $k$-clique incident to a given set $S$ of vertices with approximately equal probability. The primary difficulty is in finding cliques incident to purely high-degree vertices, since random sampling within neighbors has a low success probability. This is achieved by an algorithm that samples uniform random high degree vertices and a careful tradeoff between estimating cliques incident purely to high-degree vertices and those that include a low-degree vertex. △ Less

Submitted 12 March, 2018; v1 submitted 16 July, 2017; originally announced July 2017.

arXiv:1706.00053 [pdf, ps, other]

A Lower Bound for Nonadaptive, One-Sided Error Testing of Unateness of Boolean Functions over the Hypercube

Authors: Roksana Baleshzar, Deeparnab Chakrabarty, Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, C. Seshadhri

Abstract: A Boolean function $f:\{0,1\}^d \mapsto \{0,1\}$ is unate if, along each coordinate, the function is either nondecreasing or nonincreasing. In this note, we prove that any nonadaptive, one-sided error unateness tester must make $Ω(\frac{d}{\log d})$ queries. This result improves upon the $Ω(\frac{d}{\log^2 d})$ lower bound for the same class of testers due to Chen et al. (STOC, 2017). A Boolean function $f:\{0,1\}^d \mapsto \{0,1\}$ is unate if, along each coordinate, the function is either nondecreasing or nonincreasing. In this note, we prove that any nonadaptive, one-sided error unateness tester must make $Ω(\frac{d}{\log d})$ queries. This result improves upon the $Ω(\frac{d}{\log^2 d})$ lower bound for the same class of testers due to Chen et al. (STOC, 2017). △ Less

Submitted 31 May, 2017; originally announced June 2017.

arXiv:1704.00386 [pdf, other]

doi 10.14778/3275536.3275540

Local Algorithms for Hierarchical Dense Subgraph Discovery

Authors: Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar

Abstract: Finding the dense regions of a graph and relations among them is a fundamental problem in network analysis. Core and truss decompositions reveal dense subgraphs with hierarchical relations. The incremental nature of algorithms for computing these decompositions and the need for global information at each step of the algorithm hinders scalable parallelization and approximations since the densest re… ▽ More Finding the dense regions of a graph and relations among them is a fundamental problem in network analysis. Core and truss decompositions reveal dense subgraphs with hierarchical relations. The incremental nature of algorithms for computing these decompositions and the need for global information at each step of the algorithm hinders scalable parallelization and approximations since the densest regions are not revealed until the end. In a previous work, Lu et al. proposed to iteratively compute the $h$-indices of neighbor vertex degrees to obtain the core numbers and prove that the convergence is obtained after a finite number of iterations. This work generalizes the iterative $h$-index computation for truss decomposition as well as nucleus decomposition which leverages higher-order structures to generalize core and truss decompositions. In addition, we prove convergence bounds on the number of iterations. We present a framework of local algorithms to obtain the core, truss, and nucleus decompositions. Our algorithms are local, parallel, offer high scalability, and enable approximations to explore time and quality trade-offs. Our shared-memory implementation verifies the efficiency, scalability, and effectiveness of our local algorithms on real-world networks. △ Less

Submitted 14 September, 2018; v1 submitted 2 April, 2017; originally announced April 2017.

arXiv:1703.05199 [pdf, ps, other]

Optimal Unateness Testers for Real-Valued Functions: Adaptivity Helps

Authors: Roksana Baleshzar, Deeparnab Chakrabarty, Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, C. Seshadhri

Abstract: We study the problem of testing unateness of functions $f:\{0,1\}^d \to \mathbb{R}.$ We give a $O(\frac{d}ε \cdot \log\frac{d}ε)$-query nonadaptive tester and a $O(\frac{d}ε)$-query adaptive tester and show that both testers are optimal for a fixed distance parameter $ε$. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the di… ▽ More We study the problem of testing unateness of functions $f:\{0,1\}^d \to \mathbb{R}.$ We give a $O(\frac{d}ε \cdot \log\frac{d}ε)$-query nonadaptive tester and a $O(\frac{d}ε)$-query adaptive tester and show that both testers are optimal for a fixed distance parameter $ε$. Previously known unateness testers worked only for Boolean functions, and their query complexity had worse dependence on the dimension both for the adaptive and the nonadaptive case. Moreover, no lower bounds for testing unateness were known. We also generalize our results to obtain optimal unateness testers for functions $f:[n]^d \to \mathbb{R}$. Our results establish that adaptivity helps with testing unateness of real-valued functions on domains of the form $\{0,1\}^d$ and, more generally, $[n]^d$. This stands in contrast to the situation for monotonicity testing where there is no adaptivity gap for functions $f:[n]^d \to \mathbb{R}$. △ Less

Submitted 15 March, 2017; originally announced March 2017.

arXiv:1703.01054 [pdf, other]

doi 10.1145/3038912.3052633

When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

Authors: Aneesh Sharma, C. Seshadhri, Ashish Goel

Abstract: Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors de… ▽ More Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold $τ$. In contrast to previous work where $τ$ is assumed to be quite close to 1, we focus on recommendation applications where $τ$ is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small $τ$. To the best of our knowledge, there is no practical solution for computing all user pairs with, say $τ= 0.2$ on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges. △ Less

Submitted 3 March, 2017; originally announced March 2017.

arXiv:1611.05561 [pdf, other]

A Fast and Provable Method for Estimating Clique Counts Using Turán's Theorem

Authors: Shweta Jain, C. Seshadhri

Abstract: Clique counts reveal important properties about the structure of massive graphs, especially social networks. The simple setting of just 3-cliques (triangles) has received much attention from the research community. For larger cliques (even, say 6-cliques) the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliqu… ▽ More Clique counts reveal important properties about the structure of massive graphs, especially social networks. The simple setting of just 3-cliques (triangles) has received much attention from the research community. For larger cliques (even, say 6-cliques) the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible. We present a new randomized algorithm that provably approximates the number of k-cliques, for any constant k. The key insight is the use of (strengthenings of) the classic Turán's theorem: this claims that if the edge density of a graph is sufficiently high, the k-clique density must be non-trivial. We define a combinatorial structure called a Turán shadow, the construction of which leads to fast algorithms for clique counting. We design a practical heuristic, called TURÁN-SHADOW, based on this theoretical algorithm, and test it on a large class of test graphs. In all cases,TURÁN-SHADOW has less than 2% error, in a fraction of the time used by well-tuned exact algorithms. We do detailed comparisons with a range of other sampling algorithms, and find that TURÁN-SHADOW is generally much faster and more accurate. For example, TURÁN-SHADOW estimates all cliques numbers up to size 10 in social network with over a hundred million edges. This is done in less than three hours on a single commodity machine. △ Less

Submitted 28 August, 2018; v1 submitted 16 November, 2016; originally announced November 2016.

Comments: Added a link to the code

arXiv:1610.09411 [pdf, other]

ESCAPE: Efficiently Counting All 5-Vertex Subgraphs

Authors: Ali Pinar, C. Seshadhri, V. Vishal

Abstract: Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an al… ▽ More Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for \emph{all} 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for $5$-vertex pattern counting that runs at this scale. A step** stone to our main algorithm is a fast method for counting all $4$-vertex patterns. This algorithm is typically ten times faster than the state of the art $4$-vertex counters. △ Less

Submitted 28 October, 2016; originally announced October 2016.

arXiv:1608.06980 [pdf, ps, other]

A $\widetilde{O}(n)$ Non-Adaptive Tester for Unateness

Authors: Deeparnab Chakrabarty, C. Seshadhri

Abstract: Khot and Shinkar (RANDOM, 2016) recently describe an adaptive, $O(n \log(n)/\varepsilon)$-query tester for unateness of Boolean functions $f:\{0,1\}^n \to \{0,1\}$. In this note we describe a simple non-adaptive, $O(n \log(n/\varepsilon)/\varepsilon)$ -query tester for unateness for functions over the hypercube with any ordered range. Khot and Shinkar (RANDOM, 2016) recently describe an adaptive, $O(n \log(n)/\varepsilon)$-query tester for unateness of Boolean functions $f:\{0,1\}^n \to \{0,1\}$. In this note we describe a simple non-adaptive, $O(n \log(n/\varepsilon)/\varepsilon)$ -query tester for unateness for functions over the hypercube with any ordered range. △ Less

Submitted 2 September, 2016; v1 submitted 24 August, 2016; originally announced August 2016.

Comments: We mention the relation of our algorithm to Levin's investment strategy, as pointed out by Oded Goldreich

arXiv:1604.03661 [pdf, ps, other]

Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ vertices, and define (for $s > 0$) $μ_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $μ_s$ within a multiplicative error of $(1+ε)$ (for a given approximation parameter $ε>0$) in sublinear time. We consider the sparse graph model th… ▽ More We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ vertices, and define (for $s > 0$) $μ_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $μ_s$ within a multiplicative error of $(1+ε)$ (for a given approximation parameter $ε>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant $ε$ (Feige, SICOMP 06 and Goldreich-Ron, RSA 08). Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{1-1/(s+1)})$ queries. We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the degeneracy of $G$. This is (essentially) the maximum density of an induced subgraph. For the family of graphs with degeneracy at most $α$, it has a query complexity of $\widetilde{O}\left(\frac{n^{1-1/s}}{μ^{1/s}_s} \Big(α^{1/s} + \min\{α,μ^{1/s}_s\}\Big)\right) = \widetilde{O}(n^{1-1/s}α/μ^{1/s}_s)$. Thus, for the class of bounded degeneracy graphs (which includes all minor closed families and preferential attachment graphs), we can estimate the average degree in $\widetilde{O}(1)$ queries, and can estimate the variance of the degree distribution in $\widetilde{O}(\sqrt{n})$ queries. This is a major improvement over the previous worst-case bounds. Our key insight is in designing an estimator for $μ_s$ that has low variance when $G$ does not have large dense subgraphs. △ Less

Submitted 16 February, 2017; v1 submitted 13 April, 2016; originally announced April 2016.

arXiv:1506.08258 [pdf, other]

Trigger detection for adaptive scientific workflows using percentile sampling

Authors: Janine C. Bennett, Ankit Bhagatwala, Jacqueline H. Chen, C. Seshadhri, Ali Pinar, Maher Salloum

Abstract: Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and wit… ▽ More Increasing complexity of scientific simulations and HPC architectures are driving the need for adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for post-analysis, however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase and adapt the workflow accordingly. We present a method for detecting triggers and demonstrate its use in direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, thus is prohibitively expensive. To overcome this bottleneck, we propose a quantile-sampling approach. Our algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, thus our proposed algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition (HCCI) and reactivity controlled compression ignition (RCCI) simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used for dynamic workflow decisions about data storage and mesh resolution in future combustion simulations. Proposed framework is generalizable and we detail how it could be applied to a broad class of scientific simulation workflows. △ Less

Submitted 27 June, 2015; originally announced June 2015.

arXiv:1506.03872 [pdf, other]

doi 10.1109/ICDM.2015.46

Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search

Authors: Grey Ballard, Ali Pinar, Tamara G. Kolda, C. Seshadhri

Abstract: Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach t… ▽ More Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all $mn$ dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of $A$ and $B$. The probability of selecting a diamond corresponding to pair $(i,j)$ is proportional to $({a_i}\cdot{b_j})^2$, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. △ Less

Submitted 18 June, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

Journal ref: ICDM 2015: Proceedings of the 2015 IEEE International Conference on Data Mining, pp. 11-20, November 2015

arXiv:1506.02574 [pdf, other]

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

Authors: Olivia Simpson, C. Seshadhri, Andrew McGregor

Abstract: The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the de… ▽ More The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail. △ Less

Submitted 25 November, 2015; v1 submitted 8 June, 2015; originally announced June 2015.

arXiv:1505.01927 [pdf, ps, other]

A simpler sublinear algorithm for approximating the triangle count

Authors: C. Seshadhri

Abstract: A recent result of Eden, Levi, and Ron (ECCC 2015) provides a sublinear time algorithm to estimate the number of triangles in a graph. Given an undirected graph $G$, one can query the degree of a vertex, the existence of an edge between vertices, and the $i$th neighbor of a vertex. Suppose the graph has $n$ vertices, $m$ edges, and $t$ triangles. In this model, Eden et al provided a… ▽ More A recent result of Eden, Levi, and Ron (ECCC 2015) provides a sublinear time algorithm to estimate the number of triangles in a graph. Given an undirected graph $G$, one can query the degree of a vertex, the existence of an edge between vertices, and the $i$th neighbor of a vertex. Suppose the graph has $n$ vertices, $m$ edges, and $t$ triangles. In this model, Eden et al provided a $O(\poly(\eps^{-1}\log n)(n/t^{1/3} + m^{3/2}/t))$ time algorithm to get a $(1+\eps)$-multiplicative approximation for $t$, the triangle count. This paper provides a simpler algorithm with the same running time (up to differences in the $\poly(\eps^{-1}\log n)$ factor) that has a substantially simpler analysis. △ Less

Submitted 8 May, 2015; originally announced May 2015.

arXiv:1504.00954 [pdf, ps, other]

doi 10.1109/FOCS.2015.44

Approximately Counting Triangles in Sublinear Time

Authors: Talya Eden, Amit Levi, Dana Ron, C. Seshadhri

Abstract: We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queri… ▽ More We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries. We show that for any given approximation parameter $0<ε<1$, the algorithm provides an estimate $\widehat{t}$ such that with high constant probability, $(1-ε)\cdot t< \widehat{t}<(1+ε)\cdot t$, where $t$ is the number of triangles in the graph $G$. The expected query complexity of the algorithm is $\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)\cdot {\rm poly}(\log n, 1/ε)$, where $n$ is the number of vertices in the graph and $m$ is the number of edges, and the expected running time is $\!\left(\frac{n}{t^{1/3}} + \frac{m^{3/2}}{t}\right)\cdot {\rm poly}(\log n, 1/ε)$. We also prove that $Ω\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)$ queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in $n$ (and the dependence on $1/ε$). △ Less

Submitted 22 September, 2015; v1 submitted 3 April, 2015; originally announced April 2015.

Comments: To appear in the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015)

arXiv:1411.4942 [pdf, other]

Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

Authors: Madhav Jha, C. Seshadhri, Ali Pinar

Abstract: Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enum… ▽ More Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges. △ Less

Submitted 18 November, 2014; originally announced November 2014.

arXiv:1411.3312 [pdf, other]

Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions

Authors: Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar, Umit V. Catalyurek

Abstract: Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand… ▽ More Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlap** dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour. △ Less

Submitted 9 March, 2015; v1 submitted 12 November, 2014; originally announced November 2014.

arXiv:1411.2689 [pdf, other]

Avoiding the Global Sort: A Faster Contour Tree Algorithm

Authors: Benjamin Raichel, C. Seshadhri

Abstract: We revisit the classical problem of computing the \emph{contour tree} of a scalar field $f:\mathbb{M} \to \mathbb{R}$, where $\mathbb{M}$ is a triangulated simplicial mesh in $\mathbb{R}^d$. The contour tree is a fundamental topological structure that tracks the evolution of level sets of $f$ and has numerous applications in data analysis and visualization. All existing algorithms begin with a g… ▽ More We revisit the classical problem of computing the \emph{contour tree} of a scalar field $f:\mathbb{M} \to \mathbb{R}$, where $\mathbb{M}$ is a triangulated simplicial mesh in $\mathbb{R}^d$. The contour tree is a fundamental topological structure that tracks the evolution of level sets of $f$ and has numerous applications in data analysis and visualization. All existing algorithms begin with a global sort of at least all critical values of $f$, which can require (roughly) $Ω(n\log n)$ time. Existing lower bounds show that there are pathological instances where this sort is required. We present the first algorithm whose time complexity depends on the contour tree structure, and avoids the global sort for non-pathological inputs. If $C$ denotes the set of critical points in $\mathbb{M}$, the running time is roughly $O(\sum_{v \in C} \log \ell_v)$, where $\ell_v$ is the depth of $v$ in the contour tree. This matches all existing upper bounds, but is a significant improvement when the contour tree is short and fat. Specifically, our approach ensures that any comparison made is between nodes in the same descending path in the contour tree, allowing us to argue strong optimality properties of our algorithm. Our algorithm requires several novel ideas: partitioning $\mathbb{M}$ in well-behaved portions, a local growing procedure to iteratively build contour trees, and the use of heavy path decompositions for the time complexity analysis. △ Less

Submitted 10 December, 2015; v1 submitted 10 November, 2014; originally announced November 2014.

arXiv:1409.4360 [pdf, other]

doi 10.1103/PhysRevE.94.012301

Characterizing short-term stability for Boolean networks over any distribution of transfer functions

Authors: C. Seshadhri, Andrew M. Smith, Yevgeniy Vorobeychik, Jackson Mayo, Robert C. Armstrong

Abstract: We present a characterization of short-term stability of random Boolean networks under \emph{arbitrary} distributions of transfer functions. Given any distribution of transfer functions for a random Boolean network, we present a formula that decides whether short-term chaos (damage spreading) will happen. We provide a formal proof for this formula, and empirically show that its predictions are acc… ▽ More We present a characterization of short-term stability of random Boolean networks under \emph{arbitrary} distributions of transfer functions. Given any distribution of transfer functions for a random Boolean network, we present a formula that decides whether short-term chaos (damage spreading) will happen. We provide a formal proof for this formula, and empirically show that its predictions are accurate. Previous work only works for special cases of balanced families. It has been observed that these characterizations fail for unbalanced families, yet such families are widespread in real biological networks. △ Less

Submitted 15 September, 2014; originally announced September 2014.

Journal ref: Phys. Rev. E 94, 012301 (2016)

Showing 1–50 of 85 results for author: Seshadhri, C