Search | arXiv e-print repository

A Sublinear Algorithm for Approximate Shortest Paths in Large Networks

Authors: Sabyasachi Basu, Nadia Kōshima, Talya Eden, Omri Ben-Eliezer, C. Seshadhri

Abstract: Computing distances and finding shortest paths in massive real-world networks is a fundamental algorithmic task in network analysis. There are two main approaches to solving this task. On one hand are traversal-based algorithms like bidirectional breadth-first search (BiBFS) with no preprocessing step and slow individual distance inquiries. On the other hand are indexing-based approaches, which ma… ▽ More Computing distances and finding shortest paths in massive real-world networks is a fundamental algorithmic task in network analysis. There are two main approaches to solving this task. On one hand are traversal-based algorithms like bidirectional breadth-first search (BiBFS) with no preprocessing step and slow individual distance inquiries. On the other hand are indexing-based approaches, which maintain a large index. This allows for answering individual inquiries very fast; however, index creation is prohibitively expensive. We seek to bridge these two extremes: quickly answer distance inquiries without the need for costly preprocessing. In this work, we propose a new algorithm and data structure, WormHole, for approximate shortest path computations. WormHole leverages structural properties of social networks to build a sublinearly sized index, drawing upon the explicit core-periphery decomposition of Ben-Eliezer et al. Empirically, the preprocessing time of WormHole improves upon index-based solutions by orders of magnitude, and individual inquiries are consistently much faster than in BiBFS. The acceleration comes at the cost of a minor accuracy trade-off. Nonetheless, our empirical evidence demonstrates that WormHole accurately answers essentially all inquiries within a maximum additive error of 2. We complement these empirical results with provable theoretical guarantees, showing that WormHole requires $n^{o(1)}$ node queries per distance inquiry in random power-law networks. In contrast, any approach without a preprocessing step requires $n^{Ω(1)}$ queries for the same task. WormHole does not require reading the whole graph. Unlike the vast majority of index-based algorithms, it returns paths, not just distances. For faster inquiry times, it can be combined effectively with other index-based solutions, by running them only on the sublinear core. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07024 [pdf, other]

Plant-and-Steal: Truthful Fair Allocations via Predictions

Authors: Ilan Reuven Cohen, Alon Eden, Talya Eden, Arsen Vasilyan

Abstract: We study truthful mechanisms for approximating the Maximin-Share (MMS) allocation of agents with additive valuations for indivisible goods. Algorithmically, constant factor approximations exist for the problem for any number of agents. When adding incentives to the mix, a jarring result by Amanatidis, Birmpas, Christodoulou, and Markakis [EC 2017] shows that the best possible approximation for two… ▽ More We study truthful mechanisms for approximating the Maximin-Share (MMS) allocation of agents with additive valuations for indivisible goods. Algorithmically, constant factor approximations exist for the problem for any number of agents. When adding incentives to the mix, a jarring result by Amanatidis, Birmpas, Christodoulou, and Markakis [EC 2017] shows that the best possible approximation for two agents and $m$ items is $\lfloor \frac{m}{2} \rfloor$. We adopt a learning-augmented framework to investigate what is possible when some prediction on the input is given. For two agents, we give a truthful mechanism that takes agents' ordering over items as prediction. When the prediction is accurate, we give a $2$-approximation to the MMS (consistency), and when the prediction is off, we still get an $\lceil \frac{m}{2} \rceil$-approximation to the MMS (robustness). We further show that the mechanism's performance degrades gracefully in the number of ``mistakes" in the prediction; i.e., we interpolate (up to constant factors) between the two extremes: when there are no mistakes, and when there is a maximum number of mistakes. We also show an impossibility result on the obtainable consistency for mechanisms with finite robustness. For the general case of $n\ge 2$ agents, we give a 2-approximation mechanism for accurate predictions, with relaxed fallback guarantees. Finally, we give experimental results which illustrate when different components of our framework, made to insure consistency and robustness, come into play. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2404.18126 [pdf, other]

Testing $C_k$-freeness in bounded-arboricity graphs

Authors: Talya Eden, Reut Levi, Dana Ron

Abstract: We study the problem of testing $C_k$-freeness ($k$-cycle-freeness) for fixed constant $k > 3$ in graphs with bounded arboricity (but unbounded degrees). In particular, we are interested in one-sided error algorithms, so that they must detect a copy of $C_k$ with high constant probability when the graph is $ε$-far from $C_k$-free. We next state our results for constant arboricity and constant $ε$… ▽ More We study the problem of testing $C_k$-freeness ($k$-cycle-freeness) for fixed constant $k > 3$ in graphs with bounded arboricity (but unbounded degrees). In particular, we are interested in one-sided error algorithms, so that they must detect a copy of $C_k$ with high constant probability when the graph is $ε$-far from $C_k$-free. We next state our results for constant arboricity and constant $ε$ with a focus on the dependence on the number of graph vertices, $n$. The query complexity of all our algorithms grows polynomially with $1/ε$. (1) As opposed to the case of $k=3$, where the complexity of testing $C_3$-freeness grows with the arboricity of the graph but not with the size of the graph (Levi, ICALP 2021) this is no longer the case already for $k=4$. We show that $Ω(n^{1/4})$ queries are necessary for testing $C_4$-freeness, and that $\widetilde{O}(n^{1/4})$ are sufficient. The same bounds hold for $C_5$. (2) For every fixed $k \geq 6$, any one-sided error algorithm for testing $C_k$-freeness must perform $Ω(n^{1/3})$ queries. (3) For $k=6$ we give a testing algorithm whose query complexity is $\widetilde{O}(n^{1/2})$. (4) For any fixed $k$, the query complexity of testing $C_k$-freeness is upper bounded by ${O}(n^{1-1/\lfloor k/2\rfloor})$. Our $Ω(n^{1/4})$ lower bound for testing $C_4$-freeness in constant arboricity graphs provides a negative answer to an open problem posed by (Goldreich, 2021). △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2401.16497 [pdf, ps, other]

A Bayesian Gaussian Process-Based Latent Discriminative Generative Decoder (LDGD) Model for High-Dimensional Data

Authors: Navid Ziaei, Behzad Nazari, Uri T. Eden, Alik Widge, Ali Yousefi

Abstract: Extracting meaningful information from high-dimensional data poses a formidable modeling challenge, particularly when the data is obscured by noise or represented through different modalities. This research proposes a novel non-parametric modeling approach, leveraging the Gaussian process (GP), to characterize high-dimensional data by map** it to a latent low-dimensional manifold. This model, na… ▽ More Extracting meaningful information from high-dimensional data poses a formidable modeling challenge, particularly when the data is obscured by noise or represented through different modalities. This research proposes a novel non-parametric modeling approach, leveraging the Gaussian process (GP), to characterize high-dimensional data by map** it to a latent low-dimensional manifold. This model, named the latent discriminative generative decoder (LDGD), employs both the data and associated labels in the manifold discovery process. We derive a Bayesian solution to infer the latent variables, allowing LDGD to effectively capture inherent stochasticity in the data. We demonstrate applications of LDGD on both synthetic and benchmark datasets. Not only does LDGD infer the manifold accurately, but its accuracy in predicting data points' labels surpasses state-of-the-art approaches. In the development of LDGD, we have incorporated inducing points to reduce the computational complexity of Gaussian processes for large datasets, enabling batch training for enhanced efficient processing and scalability. Additionally, we show that LDGD can robustly infer manifold and precisely predict labels for scenarios in that data size is limited, demonstrating its capability to efficiently characterize high-dimensional data with limited samples. These collective attributes highlight the importance of develo** non-parametric modeling approaches to analyze high-dimensional data. △ Less

Submitted 7 May, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 40 pages, 6 figures

ACM Class: I.5.1; G.3

arXiv:2305.02263 [pdf, other]

Triangle Counting with Local Edge Differential Privacy

Authors: Talya Eden, Quanquan C. Liu, Sofya Raskhodnikova, Adam Smith

Abstract: Many deployments of differential privacy in industry are in the local model, where each party releases its private information via a differentially private randomizer. We study triangle counting in the noninteractive and interactive local model with edge differential privacy (that, intuitively, requires that the outputs of the algorithm on graphs that differ in one edge be indistinguishable). In t… ▽ More Many deployments of differential privacy in industry are in the local model, where each party releases its private information via a differentially private randomizer. We study triangle counting in the noninteractive and interactive local model with edge differential privacy (that, intuitively, requires that the outputs of the algorithm on graphs that differ in one edge be indistinguishable). In this model, each party's local view consists of the adjacency list of one vertex. In the noninteractive model, we prove that additive $Ω(n^2)$ error is necessary, where $n$ is the number of nodes. This lower bound is our main technical contribution. It uses a reconstruction attack with a new class of linear queries and a novel mix-and-match strategy of running the local randomizers with different completions of their adjacency lists. It matches the additive error of the algorithm based on Randomized Response, proposed by Imola, Murakami and Chaudhuri (USENIX2021) and analyzed by Imola, Murakami and Chaudhuri (CCS2022) for constant $\varepsilon$. We use a different postprocessing of Randomized Response and provide tight bounds on the variance of the resulting algorithm. In the interactive setting, we prove a lower bound of $Ω(n^{3/2})$ on the additive error. Previously, no hardness results were known for interactive, edge-private algorithms in the local model, except for those that follow trivially from the results for the central model. Our work significantly improves on the state of the art in differentially private graph analysis in the local model. △ Less

Submitted 26 September, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: ICALP 2023; update reference

arXiv:2304.10542 [pdf]

Exploring the visualisation of hierarchical cybersecurity data within the Metaverse

Authors: Terence Eden

Abstract: A prototype Metaverse experience was created in which users could explore hierarchical cybersecurity data. A small group of participants were surveyed on their attitudes to the Metaverse. They then completed a short series of tasks in the environment. Questions were asked to assess if they were suffering from Cybersickness. After completing further tasks, their attitudes were surveyed regarding fu… ▽ More A prototype Metaverse experience was created in which users could explore hierarchical cybersecurity data. A small group of participants were surveyed on their attitudes to the Metaverse. They then completed a short series of tasks in the environment. Questions were asked to assess if they were suffering from Cybersickness. After completing further tasks, their attitudes were surveyed regarding future uses of the metaverse in the organisation. A second cohort of participants attended an online seminar. They completed a survey about their attitudes to the Metaverse. They then watched a short video of the Metaverse experience. Afterwards, they answered questions related to their attitudes towards future uses of the metaverse in the organisation. The results of these questionnaires were assessed to see whether participants were receptive to the idea of working with data inside the Metaverse in the future. △ Less

Submitted 9 April, 2023; originally announced April 2023.

Comments: MSc Dissertation

arXiv:2211.04981 [pdf, other]

Sampling an Edge in Sublinear Time Exactly and Optimally

Authors: Talya Eden, Shyam Narayanan, Jakub Tětek

Abstract: Sampling edges from a graph in sublinear time is a fundamental problem and a powerful subroutine for designing sublinear-time algorithms. Suppose we have access to the vertices of the graph and know a constant-factor approximation to the number of edges. An algorithm for pointwise $\varepsilon$-approximate edge sampling with complexity $O(n/\sqrt{\varepsilon m})$ has been given by Eden and Rosenba… ▽ More Sampling edges from a graph in sublinear time is a fundamental problem and a powerful subroutine for designing sublinear-time algorithms. Suppose we have access to the vertices of the graph and know a constant-factor approximation to the number of edges. An algorithm for pointwise $\varepsilon$-approximate edge sampling with complexity $O(n/\sqrt{\varepsilon m})$ has been given by Eden and Rosenbaum [SOSA 2018]. This has been later improved by Tětek and Thorup [STOC 2022] to $O(n \log(\varepsilon^{-1})/\sqrt{m})$. At the same time, $Ω(n/\sqrt{m})$ time is necessary. We close the problem, by giving an algorithm with complexity $O(n/\sqrt{m})$ for the task of sampling an edge exactly uniformly. △ Less

Submitted 14 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2208.12025 [pdf, other]

Integrating Statistical and Machine Learning Approaches to Identify Receptive Field Structure in Neural Populations

Authors: Mehrad Sarmashghi, Shantanu P. Jadhav, Uri T. Eden

Abstract: Neurons can code for multiple variables simultaneously and neuroscientists are often interested in classifying neurons based on their receptive field properties. Statistical models provide powerful tools for determining the factors influencing neural spiking activity and classifying individual neurons. However, as neural recording technologies have advanced to produce simultaneous spiking data fro… ▽ More Neurons can code for multiple variables simultaneously and neuroscientists are often interested in classifying neurons based on their receptive field properties. Statistical models provide powerful tools for determining the factors influencing neural spiking activity and classifying individual neurons. However, as neural recording technologies have advanced to produce simultaneous spiking data from massive populations, classical statistical methods often lack the computational efficiency required to handle such data. Machine learning (ML) approaches are known for enabling efficient large scale data analyses; however, they typically require massive training sets with balanced data, along with accurate labels to fit well. Additionally, model assessment and interpretation are often more challenging for ML than for classical statistical methods. To address these challenges, we develop an integrated framework, combining statistical modeling and machine learning approaches to identify the coding properties of neurons from large populations. In order to demonstrate this framework, we apply these methods to data from a population of neurons recorded from rat hippocampus to characterize the distribution of spatial receptive fields in this region. △ Less

Submitted 27 October, 2022; v1 submitted 25 July, 2022; originally announced August 2022.

arXiv:2208.01197 [pdf, other]

Bias Reduction for Sum Estimation

Authors: Talya Eden, Jakob Bæk Tejs Houen, Shyam Narayanan, Will Rosenbaum, Jakub Tětek

Abstract: In classical statistics and distribution testing, it is often assumed that elements can be sampled from some distribution $P$, and that when an element $x$ is sampled, the probability $P$ of sampling $x$ is also known. Recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution… ▽ More In classical statistics and distribution testing, it is often assumed that elements can be sampled from some distribution $P$, and that when an element $x$ is sampled, the probability $P$ of sampling $x$ is also known. Recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution $Q$ that is sufficiently close to $P$. This phenomenon raises interesting questions: under what conditions is a "noisy" distribution $Q$ sufficient, and what is the algorithmic cost of co** with this noise? We investigate these questions for the problem of estimating the sum of a multiset of $N$ real values $x_1, \ldots, x_N$. This problem is well-studied in the statistical literature in the case $P = Q$, where the Hansen-Hurwitz estimator is frequently used. We assume that for some known distribution $P$, values are sampled from a distribution $Q$ that is pointwise close to $P$. For every positive integer $k$ we define an estimator $ζ_k$ for $μ= \sum_i x_i$ whose bias is proportional to $γ^k$ (where our $ζ_1$ reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if $Q$ is pointwise $γ$-close to uniform and all $x_i \in \{0, 1\}$, for any $ε> 0$, we can estimate $μ$ to within additive error $εN$ using $m = Θ({N^{1-\frac{1}{k}} / ε^{2/k}})$ samples, where $k = \left\lceil (\log ε)/(\log γ)\right\rceil$. We show that this sample complexity is essentially optimal. Our bounds show that the sample complexity need not vary uniformly with the desired error parameter $ε$: for some values of $ε$, perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2203.09572 [pdf, other]

Triangle and Four Cycle Counting with Predictions in Graph Streams

Authors: Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David P. Woodruff, Michael Zhang

Abstract: We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements… ▽ More We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: To be presented at ICLR 2022

arXiv:2111.10041 [pdf, other]

Embeddings and labeling schemes for A*

Authors: Talya Eden, Piotr Indyk, Haike Xu

Abstract: A* is a classic and popular method for graphs search and path finding. It assumes the existence of a heuristic function $h(u,t)$ that estimates the shortest distance from any input node $u$ to the destination $t$. Traditionally, heuristics have been handcrafted by domain experts. However, over the last few years, there has been a growing interest in learning heuristic functions. Such learned heuri… ▽ More A* is a classic and popular method for graphs search and path finding. It assumes the existence of a heuristic function $h(u,t)$ that estimates the shortest distance from any input node $u$ to the destination $t$. Traditionally, heuristics have been handcrafted by domain experts. However, over the last few years, there has been a growing interest in learning heuristic functions. Such learned heuristics estimate the distance between given nodes based on "features" of those nodes. In this paper we formalize and initiate the study of such feature-based heuristics. In particular, we consider heuristics induced by norm embeddings and distance labeling schemes, and provide lower bounds for the tradeoffs between the number of dimensions or bits used to represent each graph node, and the running time of the A* algorithm. We also show that, under natural assumptions, our lower bounds are almost optimal. △ Less

Submitted 18 November, 2021; originally announced November 2021.

Comments: ITCS 2022

arXiv:2110.15260 [pdf, ps, other]

Approximating the Arboricity in Sublinear Time

Authors: Talya Eden, Saleet Mossel, Dana Ron

Abstract: We consider the problem of approximating the arboricity of a graph $G= (V,E)$, which we denote by $\mathsf{arb}(G)$, in sublinear time, where the arboricity of a graph is the minimal number of forests required to cover its edges. An algorithm for this problem may perform degree and neighbor queries, and is allowed a small error probability. We design an algorithm that outputs an estimate $\hatα$,… ▽ More We consider the problem of approximating the arboricity of a graph $G= (V,E)$, which we denote by $\mathsf{arb}(G)$, in sublinear time, where the arboricity of a graph is the minimal number of forests required to cover its edges. An algorithm for this problem may perform degree and neighbor queries, and is allowed a small error probability. We design an algorithm that outputs an estimate $\hatα$, such that with probability $1-1/\textrm{poly}(n)$, $\mathsf{arb}(G)/c\log^2 n \leq \hatα \leq \mathsf{arb}(G)$, where $n=|V|$ and $c$ is a constant. The expected query complexity and running time of the algorithm are $O(n/\mathsf{arb}(G))\cdot \textrm{poly}(\log n)$, and this upper bound also holds with high probability. %($\widetilde{O}(\cdot)$ is used to suppress $\textrm{poly}(\log n)$ dependencies). This bound is optimal for such an approximation up to a $\textrm{poly}(\log n)$ factor. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2110.13324 [pdf, other]

Sampling Multiple Nodes in Large Networks: Beyond Random Walks

Authors: Omri Ben-Eliezer, Talya Eden, Joel Oren, Dimitris Fotakis

Abstract: Sampling random nodes is a fundamental algorithmic primitive in the analysis of massive networks, with many modern graph mining algorithms critically relying on it. We consider the task of generating a large collection of random nodes in the network assuming limited query access (where querying a node reveals its set of neighbors). In current approaches, based on long random walks, the number of q… ▽ More Sampling random nodes is a fundamental algorithmic primitive in the analysis of massive networks, with many modern graph mining algorithms critically relying on it. We consider the task of generating a large collection of random nodes in the network assuming limited query access (where querying a node reveals its set of neighbors). In current approaches, based on long random walks, the number of queries per sample scales linearly with the mixing time of the network, which can be prohibitive for large real-world networks. We propose a new method for sampling multiple nodes that bypasses the dependence in the mixing time by explicitly searching for less accessible components in the network. We test our approach on a variety of real-world and synthetic networks with up to tens of millions of nodes, demonstrating a query complexity improvement of up to $\times 20$ compared to the state of the art. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: To appear in 15th ACM International Conference on Web Search and Data Mining (WSDM 2022). Code available soon at: https://github.com/omribene/sampling-nodes

arXiv:2109.03785 [pdf, ps, other]

Adversarially Robust Streaming via Dense--Sparse Trade-offs

Authors: Omri Ben-Eliezer, Talya Eden, Krzysztof Onak

Abstract: A streaming algorithm is adversarially robust if it is guaranteed to perform correctly even in the presence of an adaptive adversary. Recently, several sophisticated frameworks for robustification of classical streaming algorithms have been developed. One of the main open questions in this area is whether efficient adversarially robust algorithms exist for moment estimation problems under the turn… ▽ More A streaming algorithm is adversarially robust if it is guaranteed to perform correctly even in the presence of an adaptive adversary. Recently, several sophisticated frameworks for robustification of classical streaming algorithms have been developed. One of the main open questions in this area is whether efficient adversarially robust algorithms exist for moment estimation problems under the turnstile streaming model, where both insertions and deletions are allowed. So far, the best known space complexity for streams of length $m$, achieved using differential privacy (DP) based techniques, is of order $\tilde{O}(m^{1/2})$ for computing a constant-factor approximation with high constant probability. In this work, we propose a new simple approach to tracking moments by alternating between two different regimes: a sparse regime, in which we can explicitly maintain the current frequency vector and use standard sparse recovery techniques, and a dense regime, in which we make use of existing DP-based robustification frameworks. The results obtained using our technique break the previous $m^{1/2}$ barrier for any fixed $p$. More specifically, our space complexity for $F_2$-estimation is $\tilde{O}(m^{2/5})$ and for $F_0$-estimation, i.e., counting the number of distinct elements, it is $\tilde O(m^{1/3})$. All existing robustness frameworks have their space complexity depend multiplicatively on a parameter $λ$ called the \emph{flip number} of the streaming problem, where $λ= m$ in turnstile moment estimation. The best known dependence in these frameworks (for constant factor approximation) is of order $\tilde{O}(λ^{1/2})$, and it is known to be tight for certain problems. Again, our approach breaks this barrier, achieving a dependence of order $\tilde{O}(λ^{1/2 - c(p)})$ for $F_p$-estimation, where $c(p) > 0$ depends only on $p$. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2107.06582 [pdf, other]

Towards a Decomposition-Optimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time

Authors: Amartya Shankha Biswas, Talya Eden, Ronitt Rubinfeld

Abstract: We consider the problem of sampling and approximately counting an arbitrary given motif $H$ in a graph $G$, where access to $G$ is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of $H$ into a collection of odd cycles and stars, denoted… ▽ More We consider the problem of sampling and approximately counting an arbitrary given motif $H$ in a graph $G$, where access to $G$ is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of $H$ into a collection of odd cycles and stars, denoted $\mathcal{D}^*(H)=\{O_{k_1}, \ldots, O_{k_q}, S_{p_1}, \ldots, S_{p_\ell}\}$. These algorithms were shown to be optimal for the case where $H$ is a clique or an odd-length cycle, but no other lower bounds were known. We present a new algorithm for sampling and approximately counting arbitrary motifs which, up to $\textrm{poly}(\log n)$ factors, is always at least as good as previous results, and for most graphs $G$ is strictly better. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the $p$-th moment of the degree distribution. Finally, we prove that this algorithm is \emph{decomposition-optimal} for decompositions that contain at least one odd cycle. These are the first lower bounds for motifs $H$ with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition. △ Less

Submitted 19 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

arXiv:2106.08396 [pdf, other]

Learning-based Support Estimation in Sublinear Time

Authors: Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

Abstract: We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the su… ▽ More We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-Θ(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: 17 pages. Published as a conference paper in ICLR 2021

arXiv:2012.04090 [pdf, ps, other]

Almost Optimal Bounds for Sublinear-Time Sampling of $k$-Cliques: Sampling Cliques is Harder Than Counting

Authors: Talya Eden, Dana Ron, Will Rosenbaum

Abstract: In this work, we consider the problem of sampling a $k$-clique in a graph from an almost uniform distribution in sublinear time in the general graph query model. Specifically the algorithm should output each $k$-clique with probability $(1\pm ε)/n_k$, where $n_k$ denotes the number of $k$-cliques in the graph and $ε$ is a given approximation parameter. We prove that the query complexity of this… ▽ More In this work, we consider the problem of sampling a $k$-clique in a graph from an almost uniform distribution in sublinear time in the general graph query model. Specifically the algorithm should output each $k$-clique with probability $(1\pm ε)/n_k$, where $n_k$ denotes the number of $k$-cliques in the graph and $ε$ is a given approximation parameter. We prove that the query complexity of this problem is \[ Θ^*\left(\max\left\{ \left(\frac{(nα)^{k/2}}{ n_k}\right)^{\frac{1}{k-1}} ,\; \min\left\{nα,\frac{nα^{k-1}}{n_k} \right\}\right\}\right). \] where $n$ is the number of vertices in the graph, $α$ is its arboricity, and $Θ^*$ suppresses the dependence on $(\log n/ε)^{O(k)}$. Interestingly, this establishes a separation between approximate counting and approximate uniform sampling in the sublinear regime. For example, if $k=3$, $α= O(1)$, and $n_3$ (the number of triangles) is $Θ(n)$, then we get a lower bound of $Ω(n^{1/4})$ (for constant $ε$), while under these conditions, a $(1\pm ε)$-approximation of $n_3$ can be obtained by performing $\textrm{poly}(\log(n/ε))$ queries (Eden, Ron and Seshadhri, SODA20). Our lower bound follows from a construction of a family of graphs with arboricity $α$ such that in each graph there are $n_k$ cliques (of size $k$), where one of these cliques is "hidden" and hence hard to sample. Our upper bound is based on defining a special auxiliary graph $H_k$, such that sampling edges almost uniformly in $H_k$ translates to sampling $k$-cliques almost uniformly in the original graph $G$. We then build on a known edge-sampling algorithm (Eden, Ron and Rosenbaum, ICALP19) to sample edges in $H_k$, where the challenge is simulate queries to $H_k$ while being given access only to $G$. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2008.08032 [pdf, ps, other]

Sampling Multiple Edges Efficiently

Authors: Talya Eden, Saleet Mossel, Ronitt Rubinfeld

Abstract: We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise $ε$-close to the uniform distribution, in an \emph{amortized-efficient} fashion. We consider the adjacency list query model, where access to a graph $G$ is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenb… ▽ More We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise $ε$-close to the uniform distribution, in an \emph{amortized-efficient} fashion. We consider the adjacency list query model, where access to a graph $G$ is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let $n$ and $m$ denote the number of vertices and edges of $G$, respectively. Eden and Rosenbaum provided upper and lower bounds of $Θ^*(n/\sqrt m)$ for sampling a single edge in general graphs (where $O^*(\cdot)$ suppresses $\textrm{poly}(1/ε)$ and $\textrm{poly}(\log n)$ dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples $q$ in advance, has an overall cost that is sublinear in $q$, namely, $O^*(\sqrt q \cdot(n/\sqrt m))$, which is strictly preferable to $O^*(q\cdot (n/\sqrt m))$ cost resulting from $q$ invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. △ Less

Submitted 19 July, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

ACM Class: F.2.2; G.2.2

arXiv:2002.08299 [pdf, other]

Massively Parallel Algorithms for Small Subgraph Counting

Authors: Amartya Shankha Biswas, Talya Eden, Quanquan C. Liu, Slobodan Mitrović, Ronitt Rubinfeld

Abstract: Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental prob… ▽ More Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental problem in analyzing massive graphs, with the main algorithmic challenges centering on designing methods which are both scalable and accurate. Given a graph $G=(V, E)$ with $n$ vertices, $m$ edges and $T$ triangles, our first result is an algorithm that outputs a $(1+\varepsilon)$-approximation to $T$, with asymptotically \emph{optimal round and total space complexity} provided any $S \geq \max{(\sqrt m, n^2/m)}$ space per machine and assuming $T=Ω(\sqrt{m/n})$. Our result gives a quadratic improvement on the bound on $T$ over previous works. We also provide a simple extension of our result to counting \emph{any} subgraph of $k$ size for constant $k \geq 1$. Our second result is an $O_{\varepsilon}(\log \log n)$-round algorithm for exactly counting the number of triangles, whose total space usage is parametrized by the \emph{arboricity} $α$ of the input graph. We extend this result to exactly counting $k$-cliques for any constant $k$. Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most $5$ can be implemented in the MPC model in total space. △ Less

Submitted 18 July, 2022; v1 submitted 19 February, 2020; originally announced February 2020.

Comments: Abstract truncated per arXiv requirements

arXiv:1902.08086 [pdf, ps, other]

The Arboricity Captures the Complexity of Sampling Edges

Authors: Talya Eden, Dana Ron, Will Rosenbaum

Abstract: In this paper, we revisit the problem of sampling edges in an unknown graph $G = (V, E)$ from a distribution that is (pointwise) almost uniform over $E$. We consider the case where there is some a priori upper bound on the arboriciy of $G$. Given query access to a graph $G$ over $n$ vertices and of average degree $d$ and arboricity at most $α$, we design an algorithm that performs… ▽ More In this paper, we revisit the problem of sampling edges in an unknown graph $G = (V, E)$ from a distribution that is (pointwise) almost uniform over $E$. We consider the case where there is some a priori upper bound on the arboriciy of $G$. Given query access to a graph $G$ over $n$ vertices and of average degree $d$ and arboricity at most $α$, we design an algorithm that performs $O\!\left(\fracα{d} \cdot \frac{\log^3 n}{\varepsilon}\right)$ queries in expectation and returns an edge in the graph such that every edge $e \in E$ is sampled with probability $(1 \pm \varepsilon)/m$. The algorithm performs two types of queries: degree queries and neighbor queries. We show that the upper bound is tight (up to poly-logarithmic factors and the dependence in $\varepsilon$), as $Ω\!\left(\fracα{d} \right)$ queries are necessary for the easier task of sampling edges from any distribution over $E$ that is close to uniform in total variational distance. We also prove that even if $G$ is a tree (i.e., $α= 1$ so that $\fracα{d}=Θ(1)$), $Ω\left(\frac{\log n}{\log\log n}\right)$ queries are necessary to sample an edge from any distribution that is pointwise close to uniform, thus establishing that a $\mathrm{poly}(\log n)$ factor is necessary for constant $α$. Finally we show how our algorithm can be applied to obtain a new result on approximately counting subgraphs, based on the recent work of Assadi, Kapralov, and Khanna (ITCS, 2019). △ Less

Submitted 21 February, 2019; originally announced February 2019.

arXiv:1811.04425 [pdf, ps, other]

Faster sublinear approximations of $k$-cliques for low arboricity graphs

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: Given query access to an undirected graph $G$, we consider the problem of computing a $(1\pmε)$-approximation of the number of $k$-cliques in $G$. The standard query model for general graphs allows for degree queries, neighbor queries, and pair queries. Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$ be the number of $k$-cliques. Previous work by Eden, Ron and Seshadhri (S… ▽ More Given query access to an undirected graph $G$, we consider the problem of computing a $(1\pmε)$-approximation of the number of $k$-cliques in $G$. The standard query model for general graphs allows for degree queries, neighbor queries, and pair queries. Let $n$ be the number of vertices, $m$ be the number of edges, and $n_k$ be the number of $k$-cliques. Previous work by Eden, Ron and Seshadhri (STOC 2018) gives an $O^*(\frac{n}{n^{1/k}_k} + \frac{m^{k/2}}{n_k})$-time algorithm for this problem (we use $O^*(\cdot)$ to suppress $\poly(\log n, 1/ε, k^k)$ dependencies). Moreover, this bound is nearly optimal when the expression is sublinear in the size of the graph. Our motivation is to circumvent this lower bound, by parameterizing the complexity in terms of \emph{graph arboricity}. The arboricity of $G$ is a measure for the graph density "everywhere". We design an algorithm for the class of graphs with arboricity at most $α$, whose running time is $O^*(\min\{\frac{nα^{k-1}}{n_k},\, \frac{n}{n_k^{1/k}}+\frac{m α^{k-2}}{n_k} \})$. We also prove a nearly matching lower bound. For all graphs, the arboricity is $O(\sqrt m)$, so this bound subsumes all previous results on sublinear clique approximation. As a special case of interest, consider minor-closed families of graphs, which have constant arboricity. Our result implies that for any minor-closed family of graphs, there is a $(1\pmε)$-approximation algorithm for $n_k$ that has running time $O^*(\frac{n}{n_k})$. Such a bound was not known even for the special (classic) case of triangle counting in planar graphs. △ Less

Submitted 11 November, 2018; originally announced November 2018.

arXiv:1710.08607 [pdf, other]

Provable and practical approximations for the degree distribution using sublinear graph samples

Authors: Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, C. Seshadhri

Abstract: The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance… ▽ More The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. The degree distribution estimation poses a significant challenge, due to its heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the $h$-index and the $z$-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1\%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10\%$ of the vertices to give comparable results. We also observe that the $h$ and $z$-indices of real graphs are large, validating our theoretical analysis. △ Less

Submitted 28 August, 2018; v1 submitted 24 October, 2017; originally announced October 2017.

Comments: Longer version of the WWW 2018 submission

arXiv:1709.04262 [pdf, ps, other]

Lower Bounds for Approximating Graph Parameters via Communication Complexity

Authors: Talya Eden, Will Rosenbaum

Abstract: In a celebrated work, Blais, Brody, and Matulef developed a technique for proving property testing lower bounds via reductions from communication complexity. Their work focused on testing properties of functions, and yielded new lower bounds as well as simplified analyses of known lower bounds. Here, we take a further step in generalizing the methodology of Blais et al. to analyze the query comple… ▽ More In a celebrated work, Blais, Brody, and Matulef developed a technique for proving property testing lower bounds via reductions from communication complexity. Their work focused on testing properties of functions, and yielded new lower bounds as well as simplified analyses of known lower bounds. Here, we take a further step in generalizing the methodology of Blais et al. to analyze the query complexity of graph parameter estimation problems. In particular, our technique decouples the lower bound arguments from the representation of the graph, allowing it to work with any query type. We illustrate our technique by providing new simpler proofs of previously known tight lower bounds for the query complexity of several graph problems: estimating the number of edges in a graph, sampling edges from an almost-uniform distribution, estimating the number of triangles (and more generally, $r$-cliques) in a graph, and estimating the moments of the degree distribution of a graph. We also prove new lower bounds for estimating the edge connectivity of a graph and estimating the number of instances of any fixed subgraph in a graph. We show that the lower bounds for estimating the number of triangles and edge connectivity also hold in a strictly stronger computational model that allows access to uniformly random edge samples. △ Less

Submitted 25 January, 2018; v1 submitted 13 September, 2017; originally announced September 2017.

Comments: Current version includes new section on graph connectivity, as well as various improvements throughout

arXiv:1707.04864 [pdf, ps, other]

Testing bounded arboricity

Authors: Talya Eden, Reut Levi, Dana Ron

Abstract: In this paper we consider the problem of testing whether a graph has bounded arboricity. The family of graphs with bounded arboricity includes, among others, bounded-degree graphs, all minor-closed graph classes (e.g. planar graphs, graphs with bounded treewidth) and randomly generated preferential attachment graphs. Graphs with bounded arboricity have been studied extensively in the past, in part… ▽ More In this paper we consider the problem of testing whether a graph has bounded arboricity. The family of graphs with bounded arboricity includes, among others, bounded-degree graphs, all minor-closed graph classes (e.g. planar graphs, graphs with bounded treewidth) and randomly generated preferential attachment graphs. Graphs with bounded arboricity have been studied extensively in the past, in particular since for many problems they allow for much more efficient algorithms and/or better approximation ratios. We present a tolerant tester in the sparse-graphs model. The sparse-graphs model allows access to degree queries and neighbor queries, and the distance is defined with respect to the actual number of edges. More specifically, our algorithm distinguishes between graphs that are $ε$-close to having arboricity $α$ and graphs that $c \cdot ε$-far from having arboricity $3α$, where $c$ is an absolute small constant. The query complexity and running time of the algorithm are $\tilde{O}\left(\frac{n}{\sqrt{m}}\cdot \frac{\log(1/ε)}ε + \frac{n\cdot α}{m} \cdot \left(\frac{1}ε\right)^{O(\log(1/ε))}\right)$ where $n$ denotes the number of vertices and $m$ denotes the number of edges. In terms of the dependence on $n$ and $m$ this bound is optimal up to poly-logarithmic factors since $Ω(n/\sqrt{m})$ queries are necessary (and $α= O(\sqrt{m}))$. We leave it as an open question whether the dependence on $1/ε$ can be improved from quasi-polynomial to polynomial. Our techniques include an efficient local simulation for approximating the outcome of a global (almost) forest-decomposition algorithm as well as a tailored procedure of edge sampling. △ Less

Submitted 27 April, 2021; v1 submitted 16 July, 2017; originally announced July 2017.

arXiv:1707.04858 [pdf, ps, other]

On Approximating the Number of $k$-cliques in Sublinear Time

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $C_k$ the number of $k$-cliques. We design an algorithm that outputs a… ▽ More We study the problem of approximating the number of $k$-cliques in a graph when given query access to the graph. We consider the standard query model for general graphs via (1) degree queries, (2) neighbor queries and (3) pair queries. Let $n$ denote the number of vertices in the graph, $m$ the number of edges, and $C_k$ the number of $k$-cliques. We design an algorithm that outputs a $(1+\varepsilon)$-approximation (with high probability) for $C_k$, whose expected query complexity and running time are $O\left(\frac{n}{C_k^{1/k}}+\frac{m^{k/2}}{C_k}\right)\poly(\log n,1/\varepsilon,k)$. Hence, the complexity of the algorithm is sublinear in the size of the graph for $C_k = ω(m^{k/2-1})$. Furthermore, we prove a lower bound showing that the query complexity of our algorithm is essentially optimal (up to the dependence on $\log n$, $1/\varepsilon$ and $k$). The previous results in this vein are by Feige (SICOMP 06) and by Goldreich and Ron (RSA 08) for edge counting ($k=2$) and by Eden et al. (FOCS 2015) for triangle counting ($k=3$). Our result matches the complexities of these results. The previous result by Eden et al. hinges on a certain amortization technique that works only for triangle counting, and does not generalize for larger cliques. We obtain a general algorithm that works for any $k\geq 3$ by designing a procedure that samples each $k$-clique incident to a given set $S$ of vertices with approximately equal probability. The primary difficulty is in finding cliques incident to purely high-degree vertices, since random sampling within neighbors has a low success probability. This is achieved by an algorithm that samples uniform random high degree vertices and a careful tradeoff between estimating cliques incident purely to high-degree vertices and those that include a low-degree vertex. △ Less

Submitted 12 March, 2018; v1 submitted 16 July, 2017; originally announced July 2017.

arXiv:1706.09748 [pdf, ps, other]

On Sampling Edges Almost Uniformly

Authors: Talya Eden, Will Rosenbaum

Abstract: We consider the problem of sampling an edge almost uniformly from an unknown graph, $G = (V, E)$. Access to the graph is provided via queries of the following types: (1) uniform vertex queries, (2) degree queries, and (3) neighbor queries. We describe an algorithm that returns a random edge $e \in E$ using $\tilde{O}(n / \sqrt{\varepsilon m})$ queries in expectation, where $n = |V|$ is the number… ▽ More We consider the problem of sampling an edge almost uniformly from an unknown graph, $G = (V, E)$. Access to the graph is provided via queries of the following types: (1) uniform vertex queries, (2) degree queries, and (3) neighbor queries. We describe an algorithm that returns a random edge $e \in E$ using $\tilde{O}(n / \sqrt{\varepsilon m})$ queries in expectation, where $n = |V|$ is the number of vertices, and $m = |E|$ is the number of edges, such that each edge $e$ is sampled with probability $(1 \pm \varepsilon)/m$. We prove that our algorithm is optimal in the sense that any algorithm that samples an edge from an almost-uniform distribution must perform $Ω(n / \sqrt{m})$ queries. △ Less

Submitted 29 June, 2017; originally announced June 2017.

arXiv:1607.03938 [pdf, ps, other]

Tolerant Junta Testing and the Connection to Submodular Optimization and Function Isomorphism

Authors: Eric Blais, Clément L. Canonne, Talya Eden, Amit Levi, Dana Ron

Abstract: A function $f\colon \{-1,1\}^n \to \{-1,1\}$ is a $k$-junta if it depends on at most $k$ of its variables. We consider the problem of tolerant testing of $k$-juntas, where the testing algorithm must accept any function that is $ε$-close to some $k$-junta and reject any function that is $ε'$-far from every $k'$-junta for some $ε'= O(ε)$ and $k' = O(k)$. Our first result is an algorithm that solve… ▽ More A function $f\colon \{-1,1\}^n \to \{-1,1\}$ is a $k$-junta if it depends on at most $k$ of its variables. We consider the problem of tolerant testing of $k$-juntas, where the testing algorithm must accept any function that is $ε$-close to some $k$-junta and reject any function that is $ε'$-far from every $k'$-junta for some $ε'= O(ε)$ and $k' = O(k)$. Our first result is an algorithm that solves this problem with query complexity polynomial in $k$ and $1/ε$. This result is obtained via a new polynomial-time approximation algorithm for submodular function minimization (SFM) under large cardinality constraints, which holds even when only given an approximate oracle access to the function. Our second result considers the case where $k'=k$. We show how to obtain a smooth tradeoff between the amount of tolerance and the query complexity in this setting. Specifically, we design an algorithm that given $ρ\in(0,1/2)$ accepts any function that is $\frac{ερ}{16}$-close to some $k$-junta and rejects any function that is $ε$-far from every $k$-junta. The query complexity of the algorithm is $O\big( \frac{k\log k}{ερ(1-ρ)^k} \big)$. Finally, we show how to apply the second result to the problem of tolerant isomorphism testing between two unknown Boolean functions $f$ and $g$. We give an algorithm for this problem whose query complexity only depends on the (unknown) smallest $k$ such that either $f$ or $g$ is close to being a $k$-junta. △ Less

Submitted 3 November, 2016; v1 submitted 13 July, 2016; originally announced July 2016.

Comments: Polished the writing, corrected typos, and fixed an issue in the proof of Theorem 1.2

arXiv:1604.03661 [pdf, ps, other]

Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection

Authors: Talya Eden, Dana Ron, C. Seshadhri

Abstract: We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ vertices, and define (for $s > 0$) $μ_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $μ_s$ within a multiplicative error of $(1+ε)$ (for a given approximation parameter $ε>0$) in sublinear time. We consider the sparse graph model th… ▽ More We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph $G=(V,E)$ with $n$ vertices, and define (for $s > 0$) $μ_s = \frac{1}{n}\cdot\sum_{v \in V} d^s_v$. Our aim is to estimate $μ_s$ within a multiplicative error of $(1+ε)$ (for a given approximation parameter $ε>0$) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of $s=1$ (the average degree), $\widetilde{O}(\sqrt{n})$ queries suffice for any constant $ε$ (Feige, SICOMP 06 and Goldreich-Ron, RSA 08). Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral $s > 0$, by designing an algorithms that performs $\widetilde{O}(n^{1-1/(s+1)})$ queries. We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the degeneracy of $G$. This is (essentially) the maximum density of an induced subgraph. For the family of graphs with degeneracy at most $α$, it has a query complexity of $\widetilde{O}\left(\frac{n^{1-1/s}}{μ^{1/s}_s} \Big(α^{1/s} + \min\{α,μ^{1/s}_s\}\Big)\right) = \widetilde{O}(n^{1-1/s}α/μ^{1/s}_s)$. Thus, for the class of bounded degeneracy graphs (which includes all minor closed families and preferential attachment graphs), we can estimate the average degree in $\widetilde{O}(1)$ queries, and can estimate the variance of the degree distribution in $\widetilde{O}(\sqrt{n})$ queries. This is a major improvement over the previous worst-case bounds. Our key insight is in designing an estimator for $μ_s$ that has low variance when $G$ does not have large dense subgraphs. △ Less

Submitted 16 February, 2017; v1 submitted 13 April, 2016; originally announced April 2016.

arXiv:1504.00954 [pdf, ps, other]

doi 10.1109/FOCS.2015.44

Approximately Counting Triangles in Sublinear Time

Authors: Talya Eden, Amit Levi, Dana Ron, C. Seshadhri

Abstract: We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queri… ▽ More We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries. We show that for any given approximation parameter $0<ε<1$, the algorithm provides an estimate $\widehat{t}$ such that with high constant probability, $(1-ε)\cdot t< \widehat{t}<(1+ε)\cdot t$, where $t$ is the number of triangles in the graph $G$. The expected query complexity of the algorithm is $\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)\cdot {\rm poly}(\log n, 1/ε)$, where $n$ is the number of vertices in the graph and $m$ is the number of edges, and the expected running time is $\!\left(\frac{n}{t^{1/3}} + \frac{m^{3/2}}{t}\right)\cdot {\rm poly}(\log n, 1/ε)$. We also prove that $Ω\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)$ queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in $n$ (and the dependence on $1/ε$). △ Less

Submitted 22 September, 2015; v1 submitted 3 April, 2015; originally announced April 2015.

Comments: To appear in the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2015)

Showing 1–29 of 29 results for author: Eden, T