Search | arXiv e-print repository

Average-Case Local Computation Algorithms

Authors: Amartya Shankha Biswas, Ruidi Cao, Edward Pyne, Ronitt Rubinfeld

Abstract: We initiate the study of Local Computation Algorithms on average case inputs. In the Local Computation Algorithm (LCA) model, we are given probe access to a huge graph, and asked to answer membership queries about some combinatorial structure on the graph, answering each query with sublinear work. For instance, an LCA for the $k$-spanner problem gives access to a sparse subgraph $H\subseteq G$ t… ▽ More We initiate the study of Local Computation Algorithms on average case inputs. In the Local Computation Algorithm (LCA) model, we are given probe access to a huge graph, and asked to answer membership queries about some combinatorial structure on the graph, answering each query with sublinear work. For instance, an LCA for the $k$-spanner problem gives access to a sparse subgraph $H\subseteq G$ that preserves distances up to a factor of $k$. We build simple LCAs for this problem assuming the input graph is drawn from the well-studied Erdos-Reyni and Preferential Attachment graph models. In both cases, our spanners achieve size and stretch tradeoffs that are impossible to achieve for general graphs, while having dramatically lower query complexity than worst-case LCAs. Our second result investigates the intersection of LCAs with Local Access Generators (LAGs). Local Access Generators provide efficient query access to a random object, for instance an Erdos Reyni random graph. We explore the natural problem of generating a random graph together with a combinatorial structure on it. We show that this combination can be easier to solve than focusing on each problem by itself, by building a fast, simple algorithm that provides access to an Erdos Reyni random graph together with a maximal independent set. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: 27 pages

arXiv:2211.03232 [pdf, other]

Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks

Authors: Anders Aamand, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Nicholas Schiefer, Sandeep Silwal, Tal Wagner

Abstract: Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes… ▽ More Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$. We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction. △ Less

Submitted 21 December, 2022; v1 submitted 6 November, 2022; originally announced November 2022.

Comments: 22 pages,5 figures, published at NeurIPS 2022. Updated funding statements

arXiv:2204.11894 [pdf, other]

Properly learning monotone functions via local reconstruction

Authors: Jane Lange, Ronitt Rubinfeld, Arsen Vasilyan

Abstract: We give a $2^{\tilde{O}(\sqrt{n}/ε)}$-time algorithm for properly learning monotone Boolean functions under the uniform distribution over $\{0,1\}^n$. Our algorithm is robust to adversarial label noise and has a running time nearly matching that of the state-of-the-art improper learning algorithm of Bshouty and Tamon (JACM '96) and an information-theoretic lower bound of Blais et al (RANDOM '15).… ▽ More We give a $2^{\tilde{O}(\sqrt{n}/ε)}$-time algorithm for properly learning monotone Boolean functions under the uniform distribution over $\{0,1\}^n$. Our algorithm is robust to adversarial label noise and has a running time nearly matching that of the state-of-the-art improper learning algorithm of Bshouty and Tamon (JACM '96) and an information-theoretic lower bound of Blais et al (RANDOM '15). Prior to this work, no proper learning algorithm with running time smaller than $2^{Ω(n)}$ was known to exist. The core of our proper learner is a \emph{local computation algorithm} for sorting binary labels on a poset. Our algorithm is built on a body of work on distributed greedy graph algorithms; specifically we rely on a recent work of Ghaffari (FOCS'22), which gives an efficient algorithm for computing maximal matchings in a graph in the LCA model of Rubinfeld et al and Alon et al (ICS'11, SODA'12). The applications of our local sorting algorithm extend beyond learning on the Boolean cube: we also give a tolerant tester for Boolean functions over general posets that distinguishes functions that are $ε/3$-close to monotone from those that are $ε$-far. Previous tolerant testers for the Boolean cube only distinguished between $ε/Ω(\sqrt{n})$-close and $ε$-far. △ Less

Submitted 27 March, 2023; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: FOCS 2022

arXiv:2204.07196 [pdf, ps, other]

Testing distributional assumptions of learning algorithms

Authors: Ronitt Rubinfeld, Arsen Vasilyan

Abstract: There are many high dimensional function classes that have fast agnostic learning algorithms when assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be confident that data indeed satisfies such assumption, so that one can trust in output quality of the agnostic learning algorithm? We propose a model by which to systematically… ▽ More There are many high dimensional function classes that have fast agnostic learning algorithms when assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be confident that data indeed satisfies such assumption, so that one can trust in output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs $(\mathcal{A},\mathcal{T})$, such that if the distribution on examples in the data passes the tester $\mathcal{T}$ then one can safely trust the output of the agnostic learner $\mathcal{A}$ on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with combined run-time of $n^{\tilde{O}(1/ε^4)}$. This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussianity testers do not exist for the $L_1$ and EMD distance measures. A key step is to show that half-spaces are well-approximated with low-degree polynomials relative to distributions with low-degree moments close to those of a Gaussian. We also go beyond spherically-symmetric distributions, and give a tester-learner pair for halfspaces under the uniform distribution on $\{0,1\}^n$ with combined run-time of $n^{\tilde{O}(1/ε^4)}$. This is achieved using polynomial approximation theory and critical index machinery. We also show there exist some well-studied settings where $2^{\tilde{O}(\sqrt{n})}$ run-time agnostic learning algorithms are available, yet the combined run-times of tester-learner pairs must be as high as $2^{Ω(n)}$. On that account, the design of tester-learner pairs is a research direction in its own right independent of standard agnostic learning. △ Less

Submitted 19 November, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

ACM Class: F.3

arXiv:2203.09572 [pdf, other]

Triangle and Four Cycle Counting with Predictions in Graph Streams

Authors: Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David P. Woodruff, Michael Zhang

Abstract: We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements… ▽ More We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: To be presented at ICLR 2022

arXiv:2107.06582 [pdf, other]

Towards a Decomposition-Optimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time

Authors: Amartya Shankha Biswas, Talya Eden, Ronitt Rubinfeld

Abstract: We consider the problem of sampling and approximately counting an arbitrary given motif $H$ in a graph $G$, where access to $G$ is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of $H$ into a collection of odd cycles and stars, denoted… ▽ More We consider the problem of sampling and approximately counting an arbitrary given motif $H$ in a graph $G$, where access to $G$ is given via queries: degree, neighbor, and pair, as well as uniform edge sample queries. Previous algorithms for these tasks were based on a decomposition of $H$ into a collection of odd cycles and stars, denoted $\mathcal{D}^*(H)=\{O_{k_1}, \ldots, O_{k_q}, S_{p_1}, \ldots, S_{p_\ell}\}$. These algorithms were shown to be optimal for the case where $H$ is a clique or an odd-length cycle, but no other lower bounds were known. We present a new algorithm for sampling and approximately counting arbitrary motifs which, up to $\textrm{poly}(\log n)$ factors, is always at least as good as previous results, and for most graphs $G$ is strictly better. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the $p$-th moment of the degree distribution. Finally, we prove that this algorithm is \emph{decomposition-optimal} for decompositions that contain at least one odd cycle. These are the first lower bounds for motifs $H$ with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition. △ Less

Submitted 19 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

arXiv:2106.08396 [pdf, other]

Learning-based Support Estimation in Sublinear Time

Authors: Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

Abstract: We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the su… ▽ More We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-Θ(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: 17 pages. Published as a conference paper in ICLR 2021

arXiv:2102.07740 [pdf, other]

Local Access to Random Walks

Authors: Amartya Shankha Biswas, Edward Pyne, Ronitt Rubinfeld

Abstract: For a graph $G$ on $n$ vertices, naively sampling the position of a random walk of at time $t$ requires work $Ω(t)$. We desire local access algorithms supporting $\text{position}(G,s,t)$ queries, which return the position of a random walk from some start vertex $s$ at time $t$, where the joint distribution of returned positions is $1/\text{poly}(n)$ close to the uniform distribution over such walk… ▽ More For a graph $G$ on $n$ vertices, naively sampling the position of a random walk of at time $t$ requires work $Ω(t)$. We desire local access algorithms supporting $\text{position}(G,s,t)$ queries, which return the position of a random walk from some start vertex $s$ at time $t$, where the joint distribution of returned positions is $1/\text{poly}(n)$ close to the uniform distribution over such walks in $\ell_1$ distance. We first give an algorithm for local access to walks on undirected regular graphs with $\widetilde{O}(\frac{1}{1-λ}\sqrt{n})$ runtime per query, where $λ$ is the second-largest eigenvalue in absolute value. Since random $d$-regular graphs are expanders with high probability, this gives an $\widetilde{O}(\sqrt{n})$ algorithm for $G(n,d)$, which improves on the naive method for small numbers of queries. We then prove that no that algorithm with sub-constant error given probe access to random $d$-regular graphs can have runtime better than $Ω(\sqrt{n}/\log(n))$ per query in expectation, obtaining a nearly matching lower bound. We further show an $Ω(n^{1/4})$ runtime per query lower bound even with an oblivious adversary (i.e. when the query sequence is fixed in advance). We then show that for families of graphs with additional group theoretic structure, dramatically better results can be achieved. We give local access to walks on small-degree abelian Cayley graphs, including cycles and hypercubes, with runtime $\text{polylog}(n)$ per query. This also allows for efficient local access to walks on $\text{polylog}$ degree expanders. We extend our results to graphs constructed using the tensor product (giving local access to walks on degree $n^ε$ graphs for any $ε\in (0,1]$) and Cartesian product. △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2012.15002 [pdf, ps, other]

New Partitioning Techniques and Faster Algorithms for Approximate Interval Scheduling

Authors: Spencer Compton, Slobodan Mitrović, Ronitt Rubinfeld

Abstract: Interval scheduling is a basic problem in the theory of algorithms and a classical task in combinatorial optimization. We develop a set of techniques for partitioning and grou** jobs based on their starting and ending times, that enable us to view an instance of interval scheduling on many jobs as a union of multiple interval scheduling instances, each containing only a few jobs. Instantiating t… ▽ More Interval scheduling is a basic problem in the theory of algorithms and a classical task in combinatorial optimization. We develop a set of techniques for partitioning and grou** jobs based on their starting and ending times, that enable us to view an instance of interval scheduling on many jobs as a union of multiple interval scheduling instances, each containing only a few jobs. Instantiating these techniques in dynamic and local settings of computation leads to several new results. For $(1+\varepsilon)$-approximation of job scheduling of $n$ jobs on a single machine, we develop a fully dynamic algorithm with $O(\frac{\log{n}}{\varepsilon})$ update and $O(\log{n})$ query worst-case time. Further, we design a local computation algorithm that uses only $O(\frac{\log{N}}{\varepsilon})$ queries when all jobs are length at least $1$ and have starting/ending times within $[0,N]$. Our techniques are also applicable in a setting where jobs have rewards/weights. For this case we design a fully dynamic deterministic algorithm whose worst-case update and query time are $\operatorname{poly}(\log n,\frac{1}{\varepsilon})$. Equivalently, this is the first algorithm that maintains a $(1+\varepsilon)$-approximation of the maximum independent set of a collection of weighted intervals in $\operatorname{poly}(\log n,\frac{1}{\varepsilon})$ time updates/queries. This is an exponential improvement in $1/\varepsilon$ over the running time of a randomized algorithm of Henzinger, Neumann, and Wiese ~[SoCG, 2020], while also removing all dependence on the values of the jobs' starting/ending times and rewards, as well as removing the need for any randomness. We also extend our approaches for interval scheduling on a single machine to examine the setting with $M$ machines. △ Less

Submitted 23 February, 2023; v1 submitted 29 December, 2020; originally announced December 2020.

Comments: Main result (Theorem 2) has stronger guarantees, updates/queries now in $\operatorname{poly}(\log(n),\frac{1}{\varepsilon})$ time

arXiv:2010.02888 [pdf, other]

Testing Tail Weight of a Distribution Via Hazard Rate

Authors: Maryam Aliakbarpour, Amartya Shankha Biswas, Kavya Ravichandran, Ronitt Rubinfeld

Abstract: Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particu… ▽ More Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions. △ Less

Submitted 4 December, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

arXiv:2008.08032 [pdf, ps, other]

Sampling Multiple Edges Efficiently

Authors: Talya Eden, Saleet Mossel, Ronitt Rubinfeld

Abstract: We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise $ε$-close to the uniform distribution, in an \emph{amortized-efficient} fashion. We consider the adjacency list query model, where access to a graph $G$ is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenb… ▽ More We present a sublinear time algorithm that allows one to sample multiple edges from a distribution that is pointwise $ε$-close to the uniform distribution, in an \emph{amortized-efficient} fashion. We consider the adjacency list query model, where access to a graph $G$ is given via degree and neighbor queries. The problem of sampling a single edge in this model has been raised by Eden and Rosenbaum (SOSA 18). Let $n$ and $m$ denote the number of vertices and edges of $G$, respectively. Eden and Rosenbaum provided upper and lower bounds of $Θ^*(n/\sqrt m)$ for sampling a single edge in general graphs (where $O^*(\cdot)$ suppresses $\textrm{poly}(1/ε)$ and $\textrm{poly}(\log n)$ dependencies). We ask whether the query complexity lower bound for sampling a single edge can be circumvented when multiple samples are required. That is, can we get an improved amortized per-sample cost if we allow a preprocessing phase? We answer in the affirmative. We present an algorithm that, if one knows the number of required samples $q$ in advance, has an overall cost that is sublinear in $q$, namely, $O^*(\sqrt q \cdot(n/\sqrt m))$, which is strictly preferable to $O^*(q\cdot (n/\sqrt m))$ cost resulting from $q$ invocations of the algorithm by Eden and Rosenbaum. Subsequent to a preliminary version of this work, Tětek and Thorup (arXiv, preprint) proved that this bound is essentially optimal. △ Less

Submitted 19 July, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

ACM Class: F.2.2; G.2.2

arXiv:2008.03891 [pdf, other]

Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

Authors: Stephen Macke, Maryam Aliakbarpour, Ilias Diakonikolas, Aditya Parameswaran, Ronitt Rubinfeld

Abstract: Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight inter… ▽ More Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By develo** a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods. △ Less

Submitted 10 August, 2020; originally announced August 2020.

arXiv:2006.05028 [pdf, other]

Online Page Migration with ML Advice

Authors: Piotr Indyk, Frederik Mallmann-Trenn, Slobodan Mitrović, Ronitt Rubinfeld

Abstract: We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve… ▽ More We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$. Specifically, the competitive ratio is equal to $1+O(q)$, where $q$ is the prediction error rate. We also design a ``fallback option'' that ensures that the competitive ratio of the algorithm for {\em any} input sequence is at most $O(1/q)$. Our result adds to the recent body of work that uses machine learning to improve the performance of ``classic'' algorithms. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:2002.08299 [pdf, other]

Massively Parallel Algorithms for Small Subgraph Counting

Authors: Amartya Shankha Biswas, Talya Eden, Quanquan C. Liu, Slobodan Mitrović, Ronitt Rubinfeld

Abstract: Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental prob… ▽ More Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental problem in analyzing massive graphs, with the main algorithmic challenges centering on designing methods which are both scalable and accurate. Given a graph $G=(V, E)$ with $n$ vertices, $m$ edges and $T$ triangles, our first result is an algorithm that outputs a $(1+\varepsilon)$-approximation to $T$, with asymptotically \emph{optimal round and total space complexity} provided any $S \geq \max{(\sqrt m, n^2/m)}$ space per machine and assuming $T=Ω(\sqrt{m/n})$. Our result gives a quadratic improvement on the bound on $T$ over previous works. We also provide a simple extension of our result to counting \emph{any} subgraph of $k$ size for constant $k \geq 1$. Our second result is an $O_{\varepsilon}(\log \log n)$-round algorithm for exactly counting the number of triangles, whose total space usage is parametrized by the \emph{arboricity} $α$ of the input graph. We extend this result to exactly counting $k$-cliques for any constant $k$. Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most $5$ can be implemented in the MPC model in total space. △ Less

Submitted 18 July, 2022; v1 submitted 19 February, 2020; originally announced February 2020.

Comments: Abstract truncated per arXiv requirements

arXiv:2002.03415 [pdf, ps, other]

doi 10.4230/LIPIcs.ITCS.2020.28

Monotone probability distributions over the Boolean cube can be learned with sublinear samples

Authors: Ronitt Rubinfeld, Arsen Vasilyan

Abstract: A probability distribution over the Boolean cube is monotone if flip** the value of a coordinate from zero to one can only increase the probability of an element. Given samples of an unknown monotone distribution over the Boolean cube, we give (to our knowledge) the first algorithm that learns an approximation of the distribution in statistical distance using a number of samples that is sublinea… ▽ More A probability distribution over the Boolean cube is monotone if flip** the value of a coordinate from zero to one can only increase the probability of an element. Given samples of an unknown monotone distribution over the Boolean cube, we give (to our knowledge) the first algorithm that learns an approximation of the distribution in statistical distance using a number of samples that is sublinear in the domain. To do this, we develop a structural lemma describing monotone probability distributions. The structural lemma has further implications to the sample complexity of basic testing tasks for analyzing monotone probability distributions over the Boolean cube: We use it to give nontrivial upper bounds on the tasks of estimating the distance of a monotone distribution to uniform and of estimating the support size of a monotone distribution. In the setting of monotone probability distributions over the Boolean cube, our algorithms are the first to have sample complexity lower than known lower bounds for the same testing tasks on arbitrary (not necessarily monotone) probability distributions. One further consequence of our learning algorithm is an improved sample complexity for the task of testing whether a distribution on the Boolean cube is monotone. △ Less

Submitted 9 February, 2020; originally announced February 2020.

arXiv:1910.14154 [pdf, ps, other]

Improved Local Computation Algorithm for Set Cover via Sparsification

Authors: Christoph Grunau, Slobodan Mitrović, Ronitt Rubinfeld, Ali Vakilian

Abstract: We design a Local Computation Algorithm (LCA) for the set cover problem. Given a set system where each set has size at most $s$ and each element is contained in at most $t$ sets, the algorithm reports whether a given set is in some fixed set cover whose expected size is $O(\log{s})$ times the minimum fractional set cover value. Our algorithm requires… ▽ More We design a Local Computation Algorithm (LCA) for the set cover problem. Given a set system where each set has size at most $s$ and each element is contained in at most $t$ sets, the algorithm reports whether a given set is in some fixed set cover whose expected size is $O(\log{s})$ times the minimum fractional set cover value. Our algorithm requires $s^{O(\log{s})} t^{O(\log{s} \cdot (\log \log{s} + \log \log{t}))}$ queries. This result improves upon the application of the reduction of [Parnas and Ron, TCS'07] on the result of [Kuhn et al., SODA'06], which leads to a query complexity of $(st)^{O(\log{s} \cdot \log{t})}$. To obtain this result, we design a parallel set cover algorithm that admits an efficient simulation in the LCA model by using a sparsification technique introduced in [Ghaffari and Uitto, SODA'19] for the maximal independent set problem. The parallel algorithm adds a random subset of the sets to the solution in a style similar to the PRAM algorithm of [Berger et al., FOCS'89]. However, our algorithm differs in the way that it never revokes its decisions, which results in a fewer number of adaptive rounds. This requires a novel approximation analysis which might be of independent interest. △ Less

Submitted 5 November, 2019; v1 submitted 30 October, 2019; originally announced October 2019.

Comments: To appear in ACM-SIAM Symposium on Discrete Algorithms (SODA 2020)

arXiv:1907.03190 [pdf, ps, other]

Testing Mixtures of Discrete Distributions

Authors: Maryam Aliakbarpour, Ravi Kumar, Ronitt Rubinfeld

Abstract: There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$, distinguishing the uniform distribution from distributions that are far from uniform in $\ell_1$-distance uses only $O(\sqrt{n})$ sa… ▽ More There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$, distinguishing the uniform distribution from distributions that are far from uniform in $\ell_1$-distance uses only $O(\sqrt{n})$ samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is $ε$-close to uniform from the case where the distribution is $(1-ε)$-far from uniform. The latter task requires nearly linear in $n$ samples [Valiant 2008, Valian and Valiant 2011]. In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known a priori. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions $p, q_1,q_2$, can we test if $p$ is a mixture of $q_1$ and $q_2$? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical non-mixture case. △ Less

Submitted 6 July, 2019; originally announced July 2019.

Comments: Appeared in COLT 2019

arXiv:1907.03182 [pdf, ps, other]

Towards Testing Monotonicity of Distributions Over General Posets

Authors: Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, Anak Yodpinyanee

Abstract: In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution $p$ over a poset is monotone if, for any pair of domain elements $x$ and $y$ such that $x \preceq y$, $p(x) \leq p(y)$. To understand the sample complexity of this problem, we introduce a new property called bigness over a finite domain, where the distribution… ▽ More In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution $p$ over a poset is monotone if, for any pair of domain elements $x$ and $y$ such that $x \preceq y$, $p(x) \leq p(y)$. To understand the sample complexity of this problem, we introduce a new property called bigness over a finite domain, where the distribution is $T$-big if the minimum probability for any domain element is at least $T$. We establish a lower bound of $Ω(n/\log n)$ for testing bigness of distributions on domains of size $n$. We then build on these lower bounds to give $Ω(n/\log{n})$ lower bounds for testing monotonicity over a matching poset of size $n$ and significantly improved lower bounds over the hypercube poset. We give sublinear sample complexity bounds for testing bigness and for testing monotonicity over the matching poset. We then give a number of tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem. △ Less

Submitted 6 July, 2019; originally announced July 2019.

Comments: Appeared in COLT 2019

arXiv:1904.06745 [pdf, ps, other]

Approximating the noise sensitivity of a monotone Boolean function

Authors: Ronitt Rubinfeld, Arsen Vasilyan

Abstract: The noise sensitivity of a Boolean function $f: \{0,1\}^n \rightarrow \{0,1\}$ is one of its fundamental properties. A function of a positive noise parameter $δ$, it is denoted as $NS_δ[f]$. Here we study the algorithmic problem of approximating it for monotone $f$, such that $NS_δ[f] \geq 1/n^{C}$ for constant $C$, and where $δ$ satisfies $1/n \leq δ\leq 1/2$. For such $f$ and $δ$, we give a rand… ▽ More The noise sensitivity of a Boolean function $f: \{0,1\}^n \rightarrow \{0,1\}$ is one of its fundamental properties. A function of a positive noise parameter $δ$, it is denoted as $NS_δ[f]$. Here we study the algorithmic problem of approximating it for monotone $f$, such that $NS_δ[f] \geq 1/n^{C}$ for constant $C$, and where $δ$ satisfies $1/n \leq δ\leq 1/2$. For such $f$ and $δ$, we give a randomized algorithm performing $O\left(\frac{\min(1,\sqrt{n} δ\log^{1.5} n) }{NS_δ[f]} \text{poly}\left(\frac{1}ε\right)\right)$ queries and approximating $NS_δ[f]$ to within a multiplicative factor of $(1\pm ε)$. Given the same constraints on $f$ and $δ$, we also prove a lower bound of $Ω\left(\frac{\min(1,\sqrt{n} δ)}{NS_δ[f] \cdot n^ξ}\right)$ on the query complexity of any algorithm that approximates $NS_δ[f]$ to within any constant factor, where $ξ$ can be any positive constant. Thus, our algorithm's query complexity is close to optimal in terms of its dependence on $n$. We introduce a novel descending-ascending view of noise sensitivity, and use it as a central tool for the analysis of our algorithm. To prove lower bounds on query complexity, we develop a technique that reduces computational questions about query complexity to combinatorial questions about the existence of "thin" functions with certain properties. The existence of such "thin" functions is proved using the probabilistic method. These techniques also yield previously unknown lower bounds on the query complexity of approximating other fundamental properties of Boolean functions: the total influence and the bias. △ Less

Submitted 14 April, 2019; originally announced April 2019.

arXiv:1902.08266 [pdf, other]

Local Computation Algorithms for Spanners

Authors: Merav Parter, Ronitt Rubinfeld, Ali Vakilian, Anak Yodpinyanee

Abstract: A graph spanner is a fundamental graph structure that faithfully preserves the pairwise distances in the input graph up to a small multiplicative stretch. The common objective in the computation of spanners is to achieve the best-known existential size-stretch trade-off efficiently. Classical models and algorithmic analysis of graph spanners essentially assume that the algorithm can read the inp… ▽ More A graph spanner is a fundamental graph structure that faithfully preserves the pairwise distances in the input graph up to a small multiplicative stretch. The common objective in the computation of spanners is to achieve the best-known existential size-stretch trade-off efficiently. Classical models and algorithmic analysis of graph spanners essentially assume that the algorithm can read the input graph, construct the desired spanner, and write the answer to the output tape. However, when considering massive graphs containing millions or even billions of nodes not only the input graph, but also the output spanner might be too large for a single processor to store. To tackle this challenge, we initiate the study of local computation algorithms (LCAs) for graph spanners in general graphs, where the algorithm should locally decide whether a given edge $(u,v) \in E$ belongs to the output spanner. Such LCAs give the user the `illusion' that a specific sparse spanner for the graph is maintained, without ever fully computing it. We present the following results: -For general $n$-vertex graphs and $r \in \{2,3\}$, there exists an LCA for $(2r-1)$-spanners with $\widetilde{O}(n^{1+1/r})$ edges and sublinear probe complexity of $\widetilde{O}(n^{1-1/2r})$. These size/stretch tradeoffs are best possible (up to polylogarithmic factors). -For every $k \geq 1$ and $n$-vertex graph with maximum degree $Δ$, there exists an LCA for $O(k^2)$ spanners with $\widetilde{O}(n^{1+1/k})$ edges, probe complexity of $\widetilde{O}(Δ^4 n^{2/3})$, and random seed of size $\mathrm{polylog}(n)$. This improves upon, and extends the work of [Lenzen-Levi, 2018]. We also complement our results by providing a polynomial lower bound on the probe complexity of LCAs for graph spanners that holds even for the simpler task of computing a sparse connected subgraph with $o(m)$ edges. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: An extended abstract appeared in the proceedings of ITCS 2019

arXiv:1902.03534 [pdf, ps, other]

Set Cover in Sub-linear Time

Authors: Piotr Indyk, Sepideh Mahabadi, Ronitt Rubinfeld, Ali Vakilian, Anak Yodpinyanee

Abstract: We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of $m$ sets over $n$ elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear quer… ▽ More We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of $m$ sets over $n$ elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear query model, that returns an $α$-approximate cover using $\tilde{O}(m(n/k)^{1/(α-1)} + nk)$ queries to the input, where $k$ denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of $k$, the required number of queries is $\tildeΩ(m(n/k)^{1/(2α)})$, even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require $Ω(nk)$ queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter $k$. On the other hand, we show that this bound is not optimal for larger values of the parameter $k$, as there exists a $(1+\varepsilon)$-approximation algorithm with $\tilde{O}(mn/k\varepsilon^2)$ queries. We show that this bound is essentially tight for sufficiently small constant $\varepsilon$, by establishing a lower bound of $\tildeΩ(mn/k)$ query complexity. △ Less

Submitted 9 February, 2019; originally announced February 2019.

arXiv:1802.08237 [pdf, ps, other]

Improved Massively Parallel Computation Algorithms for MIS, Matching, and Vertex Cover

Authors: Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrović, Ronitt Rubinfeld

Abstract: We present $O(\log\log n)$-round algorithms in the Massively Parallel Computation (MPC) model, with $\tilde{O}(n)$ memory per machine, that compute a maximal independent set, a $1+ε$ approximation of maximum matching, and a $2+ε$ approximation of minimum vertex cover, for any $n$-vertex graph and any constant $ε>0$. These improve the state of the art as follows: - Our MIS algorithm leads to a si… ▽ More We present $O(\log\log n)$-round algorithms in the Massively Parallel Computation (MPC) model, with $\tilde{O}(n)$ memory per machine, that compute a maximal independent set, a $1+ε$ approximation of maximum matching, and a $2+ε$ approximation of minimum vertex cover, for any $n$-vertex graph and any constant $ε>0$. These improve the state of the art as follows: - Our MIS algorithm leads to a simple $O(\log\log Δ)$-round MIS algorithm in the Congested Clique model of distributed computing, which improves on the $\tilde{O}(\sqrt{\log Δ})$-round algorithm of Ghaffari [PODC'17]. - Our $O(\log\log n)$-round $(1+ε)$-approximate maximum matching algorithm simplifies or improves on the following prior work: $O(\log^2\log n)$-round $(1+ε)$-approximation algorithm of Czumaj et al. [STOC'18] and $O(\log\log n)$-round $(1+ε)$-approximation algorithm of Assadi et al. [SODA'19]. - Our $O(\log\log n)$-round $(2+ε)$-approximate minimum vertex cover algorithm improves on an $O(\log\log n)$-round $O(1)$-approximation of Assadi et al. [arXiv'17]. △ Less

Submitted 17 March, 2022; v1 submitted 22 February, 2018; originally announced February 2018.

arXiv:1711.10692 [pdf, other]

Local Access to Huge Random Objects through Partial Sampling

Authors: Amartya Shankha Biswas, Ronitt Rubinfeld, Anak Yodpinyanee

Abstract: Consider an algorithm performing a computation on a huge random object. Is it necessary to generate the entire object up front, or is it possible to provide query access to the object and sample it incrementally "on-the-fly"? Such an implementation should emulate the object by answering queries in a manner consistent with a random instance sampled from the true distribution. Our first set of res… ▽ More Consider an algorithm performing a computation on a huge random object. Is it necessary to generate the entire object up front, or is it possible to provide query access to the object and sample it incrementally "on-the-fly"? Such an implementation should emulate the object by answering queries in a manner consistent with a random instance sampled from the true distribution. Our first set of results focus on undirected graphs with independent edge probabilities, under certain assumptions. Then, we use this to obtain the first efficient implementations for the Erdos-Renyi model and the Stochastic Block model. As in previous local-access implementations for random graphs, we support Vertex-Pair and Next-Neighbor queries. We also introduce a new Random-Neighbor query. Next, we show how to implement random Catalan objects, specifically focusing on Dyck paths (always positive random walks on the integer line). Here, we support Height queries to find the position of the walk, and First-Return queries to find the time when the walk returns to a specified height. This in turn can be used to implement Next-Neighbor queries on random rooted/binary trees, and Matching-Bracket queries on random well bracketed expressions. Finally, we define a new model that: (1) allows multiple independent simultaneous instantiations of the same implementation to be consistent with each other without communication (2) allows us to generate a richer class of random objects that do not have a succinct description. Specifically, we study uniformly random valid $q$-colorings of an input graph $G$ with max degree $Δ$. The distribution over valid colorings is specified via a "huge" underlying graph $G$, that is far too large to be read in sub-linear time. Instead, we access $G$ through local neighborhood probes. We are able to answer queries to the color of any vertex in sub-linear time for $q > 9Δ$. △ Less

Submitted 5 December, 2020; v1 submitted 29 November, 2017; originally announced November 2017.

arXiv:1707.05497 [pdf, other]

Differentially Private Identity and Closeness Testing of Discrete Distributions

Authors: Maryam Aliakbarpour, Ilias Diakonikolas, Ronitt Rubinfeld

Abstract: We investigate the problems of identity and closeness testing over a discrete population from random samples. Our goal is to develop efficient testers while guaranteeing Differential Privacy to the individuals of the population. We describe an approach that yields sample-efficient differentially private testers for these problems. Our theoretical results show that there exist private identity and… ▽ More We investigate the problems of identity and closeness testing over a discrete population from random samples. Our goal is to develop efficient testers while guaranteeing Differential Privacy to the individuals of the population. We describe an approach that yields sample-efficient differentially private testers for these problems. Our theoretical results show that there exist private identity and closeness testers that are nearly as sample-efficient as their non-private counterparts. We perform an experimental evaluation of our algorithms on synthetic data. Our experiments illustrate that our private testers achieve small type I and type II errors with sample size sublinear in the domain size of the underlying distributions. △ Less

Submitted 18 July, 2017; originally announced July 2017.

Comments: Submitted, May 2017

arXiv:1604.07038 [pdf, ps, other]

A Local Algorithm for Constructing Spanners in Minor-Free Graphs

Authors: Reut Levi, Dana Ron, Ronitt Rubinfeld

Abstract: Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. We consider this problem in the setting of local algorithms: one wants to quickly determine whether a given edge $e$ is in a specific spanning tree, without computing the whole spanning tree, but rather by inspecting the local neighborhood of $e$. The challenge is to maintain consistency. That is, to answer que… ▽ More Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. We consider this problem in the setting of local algorithms: one wants to quickly determine whether a given edge $e$ is in a specific spanning tree, without computing the whole spanning tree, but rather by inspecting the local neighborhood of $e$. The challenge is to maintain consistency. That is, to answer queries about different edges according to the same spanning tree. Since it is known that this problem cannot be solved without essentially viewing all the graph, we consider the relaxed version of finding a spanning subgraph with $(1+ε)n$ edges (where $n$ is the number of vertices and $ε$ is a given sparsity parameter). It is known that this relaxed problem requires inspecting $Ω(\sqrt{n})$ edges in general graphs, which motivates the study of natural restricted families of graphs. One such family is the family of graphs with an excluded minor. For this family there is an algorithm that achieves constant success probability, and inspects $(d/ε)^{poly(h)\log(1/ε)}$ edges (for each edge it is queried on), where $d$ is the maximum degree in the graph and $h$ is the size of the excluded minor. The distances between pairs of vertices in the spanning subgraph $G'$ are at most a factor of $poly(d, 1/ε, h)$ larger than in $G$. In this work, we show that for an input graph that is $H$-minor free for any $H$ of size $h$, this task can be performed by inspecting only $poly(d, 1/ε, h)$ edges. The distances between pairs of vertices in the spanning subgraph $G'$ are at most a factor of $\tilde{O}(h\log(d)/ε)$ larger than in $G$. Furthermore, the error probability of the new algorithm is significantly improved to $Θ(1/n)$. This algorithm can also be easily adapted to yield an efficient algorithm for the distributed setting. △ Less

Submitted 24 April, 2016; originally announced April 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1402.3609

arXiv:1601.04233 [pdf, other]

Sublinear-Time Algorithms for Counting Star Subgraphs with Applications to Join Selectivity Estimation

Authors: Maryam Aliakbarpour, Amartya Shankha Biswas, Themistoklis Gouleakis, John Peebles, Ronitt Rubinfeld, Anak Yodpinyanee

Abstract: We study the problem of estimating the value of sums of the form $S_p \triangleq \sum \binom{x_i}{p}$ when one has the ability to sample $x_i \geq 0$ with probability proportional to its magnitude. When $p=2$, this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when $\{x_i\}$ is the degr… ▽ More We study the problem of estimating the value of sums of the form $S_p \triangleq \sum \binom{x_i}{p}$ when one has the ability to sample $x_i \geq 0$ with probability proportional to its magnitude. When $p=2$, this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when $\{x_i\}$ is the degree sequence of a graph, which corresponds to counting the number of $p$-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a $(1 \pm \varepsilon)$-multiplicative approximation of $S_p$ has query and time complexities $Ø(\frac{m \log \log n}{ε^2 S_p^{1/p}})$. Here, $m=\sum x_i/2$ is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, $n$ is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when $\{x_i\}$ is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only \emph{vertices} uniformly gave algorithms with matching lower bounds [Gonen, Ron, and Shavitt. \textit{SIAM J. Comput.}, 25 (2011), pp. 1365-1411]. With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where $S_p\leq n$, and $p=2$, our upper bound is $\tilde{O}(n/S_p^{1/2})$, in contrast to their $Ω(n/S_p^{1/3})$ lower bound when no random edge queries are available. △ Less

Submitted 16 January, 2016; originally announced January 2016.

Comments: 21 pages

arXiv:1507.03558 [pdf, ps, other]

Testing Shape Restrictions of Discrete Distributions

Authors: Clément L. Canonne, Ilias Diakonikolas, Themis Gouleakis, Ronitt Rubinfeld

Abstract: We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution $D$ over $[n]$ and a property $\mathcal{P}$, the goal is to distinguish between $D\in\mathcal{P}$ and $\ell_1(D,\mathcal{P})>\varepsilon$. We develop a general algorithm for this question, which applies to a large range of "shape-constrained" pr… ▽ More We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution $D$ over $[n]$ and a property $\mathcal{P}$, the goal is to distinguish between $D\in\mathcal{P}$ and $\ell_1(D,\mathcal{P})>\varepsilon$. We develop a general algorithm for this question, which applies to a large range of "shape-constrained" properties, including monotone, log-concave, $t$-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly-tight upper and lower bounds for the corresponding questions. △ Less

Submitted 21 January, 2016; v1 submitted 13 July, 2015; originally announced July 2015.

arXiv:1504.06544 [pdf, ps, other]

Sampling Correctors

Authors: Clément Canonne, Themis Gouleakis, Ronitt Rubinfeld

Abstract: In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between th… ▽ More In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance $O(1/\log^2 n)$). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We also consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process. △ Less

Submitted 31 March, 2018; v1 submitted 24 April, 2015; originally announced April 2015.

arXiv:1502.04022 [pdf, ps, other]

Local Computation Algorithms for Graphs of Non-Constant Degrees

Authors: Reut Levi, Ronitt Rubinfeld, Anak Yodpinyanee

Abstract: In the model of \emph{local computation algorithms} (LCAs), we aim to compute the queried part of the output by examining only a small (sublinear) portion of the input. Many recently developed LCAs on graph problems achieve time and space complexities with very low dependence on $n$, the number of vertices. Nonetheless, these complexities are generally at least exponential in $d$, the upper bound… ▽ More In the model of \emph{local computation algorithms} (LCAs), we aim to compute the queried part of the output by examining only a small (sublinear) portion of the input. Many recently developed LCAs on graph problems achieve time and space complexities with very low dependence on $n$, the number of vertices. Nonetheless, these complexities are generally at least exponential in $d$, the upper bound on the degree of the input graph. Instead, we consider the case where parameter $d$ can be moderately dependent on $n$, and aim for complexities with subexponential dependence on $d$, while maintaining polylogarithmic dependence on $n$. We present: a randomized LCA for computing maximal independent sets whose time and space complexities are quasi-polynomial in $d$ and polylogarithmic in $n$; for constant $ε> 0$, a randomized LCA that provides a $(1-ε)$-approximation to maximum matching whose time and space complexities are polynomial in $d$ and polylogarithmic in $n$. △ Less

Submitted 13 February, 2015; originally announced February 2015.

arXiv:1502.00413 [pdf, ps, other]

Constructing Near Spanning Trees with Few Local Inspections

Authors: Reut Levi, Guy Moshkovitz, Dana Ron, Ronitt Rubinfeld, Asaf Shapira

Abstract: Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. Motivated by several recent studies of local graph algorithms, we consider the following variant of this problem. Let G be a connected bounded-degree graph. Given an edge $e$ in $G$ we would like to decide whether $e$ belongs to a connected subgraph $G'$ consisting of $(1+ε)n$ edges (for a prespecified constant… ▽ More Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. Motivated by several recent studies of local graph algorithms, we consider the following variant of this problem. Let G be a connected bounded-degree graph. Given an edge $e$ in $G$ we would like to decide whether $e$ belongs to a connected subgraph $G'$ consisting of $(1+ε)n$ edges (for a prespecified constant $ε>0$), where the decision for different edges should be consistent with the same subgraph $G'$. Can this task be performed by inspecting only a {\em constant} number of edges in $G$? Our main results are: (1) We show that if every $t$-vertex subgraph of $G$ has expansion $1/(\log t)^{1+o(1)}$ then one can (deterministically) construct a sparse spanning subgraph $G'$ of $G$ using few inspections. To this end we analyze a "local" version of a famous minimum-weight spanning tree algorithm. (2) We show that the above expansion requirement is sharp even when allowing randomization. To this end we construct a family of $3$-regular graphs of high girth, in which every $t$-vertex subgraph has expansion $1/(\log t)^{1-o(1)}$. △ Less

Submitted 3 February, 2015; v1 submitted 2 February, 2015; originally announced February 2015.

Comments: References fixed

arXiv:1412.5484 [pdf, ps, other]

doi 10.1007/s00224-015-9639-z

A Self-Tester for Linear Functions over the Integers with an Elementary Proof of Correctness

Authors: Sheela Devadas, Ronitt Rubinfeld

Abstract: We present simple, self-contained proofs of correctness for algorithms for linearity testing and program checking of linear functions on finite subsets of integers represented as n-bit numbers. In addition we explore a generalization of self-testing to homomorphisms on a multidimensional vector space. We show that our self-testing algorithm for the univariate case can be directly generalized to ve… ▽ More We present simple, self-contained proofs of correctness for algorithms for linearity testing and program checking of linear functions on finite subsets of integers represented as n-bit numbers. In addition we explore a generalization of self-testing to homomorphisms on a multidimensional vector space. We show that our self-testing algorithm for the univariate case can be directly generalized to vector space domains. The number of queries made by our algorithms is independent of domain size. △ Less

Submitted 22 June, 2015; v1 submitted 17 December, 2014; originally announced December 2014.

arXiv:1412.3040 [pdf, other]

Rapid Sampling for Visualizations with Ordering Guarantees

Authors: Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, Ronitt Rubinfeld

Abstract: Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our… ▽ More Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes. △ Less

Submitted 9 December, 2014; originally announced December 2014.

Comments: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No. 5

arXiv:1402.3835 [pdf, ps, other]

Testing probability distributions underlying aggregated data

Authors: Clément Canonne, Ronitt Rubinfeld

Abstract: In this paper, we analyze and study a hybrid model for testing and learning probability distributions. Here, in addition to samples, the testing algorithm is provided with one of two different types of oracles to the unknown distribution $D$ over $[n]$. More precisely, we define both the dual and cumulative dual access models, in which the algorithm $A$ can both sample from $D$ and respectively, f… ▽ More In this paper, we analyze and study a hybrid model for testing and learning probability distributions. Here, in addition to samples, the testing algorithm is provided with one of two different types of oracles to the unknown distribution $D$ over $[n]$. More precisely, we define both the dual and cumulative dual access models, in which the algorithm $A$ can both sample from $D$ and respectively, for any $i\in[n]$, - query the probability mass $D(i)$ (query access); or - get the total mass of $\{1,\dots,i\}$, i.e. $\sum_{j=1}^i D(j)$ (cumulative access) These two models, by generalizing the previously studied sampling and query oracle models, allow us to bypass the strong lower bounds established for a number of problems in these settings, while capturing several interesting aspects of these problems -- and providing new insight on the limitations of the models. Finally, we show that while the testing algorithms can be in most cases strictly more efficient, some tasks remain hard even with this additional power. △ Less

Submitted 16 February, 2014; originally announced February 2014.

arXiv:1402.3609 [pdf, ps, other]

Local Algorithms for Sparse Spanning Graphs

Authors: Reut Levi, Dana Ron, Ronitt Rubinfeld

Abstract: Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. We consider a relaxed version of this problem in the setting of local algorithms. The relaxation is that the constructed subgraph is a sparse spanning subgraph containing at most $(1+ε)n$ edges (where $n$ is the number of vertices and $ε$ is a given approximation/sparsity parameter). In the local setting, the g… ▽ More Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. We consider a relaxed version of this problem in the setting of local algorithms. The relaxation is that the constructed subgraph is a sparse spanning subgraph containing at most $(1+ε)n$ edges (where $n$ is the number of vertices and $ε$ is a given approximation/sparsity parameter). In the local setting, the goal is to quickly determine whether a given edge $e$ belongs to such a subgraph, without constructing the whole subgraph, but rather by inspecting (querying) the local neighborhood of $e$. The challenge is to maintain consistency. That is, to provide answers concerning different edges according to the same spanning subgraph. We first show that for general bounded-degree graphs, the query complexity of any such algorithm must be $Ω(\sqrt{n})$. This lower bound holds for constant-degree graphs that have high expansion. Next we design an algorithm for (bounded-degree) graphs with high expansion, obtaining a result that roughly matches the lower bound. We then turn to study graphs that exclude a fixed minor (and are hence non-expanding). We design an algorithm for such graphs, which may have an unbounded maximum degree. The query complexity of this algorithm is $poly(1/ε, h)$ (independent of $n$ and the maximum degree), where $h$ is the number of vertices in the excluded minor. Though our two algorithms are designed for very different types of graphs (and have very different complexities), on a high-level there are several similarities, and we highlight both the similarities and the differences. △ Less

Submitted 27 April, 2021; v1 submitted 14 February, 2014; originally announced February 2014.

Comments: Upper bounds for expanding graphs and minor free graphs

arXiv:1301.2495 [pdf, ps, other]

A simple online competitive adaptation of Lempel-Ziv compression with efficient random access support

Authors: Akashnil Dutta, Reut Levi, Dana Ron, Ronitt Rubinfeld

Abstract: We present a simple adaptation of the Lempel Ziv 78' (LZ78) compression scheme ({\em IEEE Transactions on Information Theory, 1978}) that supports efficient random access to the input string. Namely, given query access to the compressed string, it is possible to efficiently recover any symbol of the input string. The compression algorithm is given as input a parameter $\eps >0$, and with very high… ▽ More We present a simple adaptation of the Lempel Ziv 78' (LZ78) compression scheme ({\em IEEE Transactions on Information Theory, 1978}) that supports efficient random access to the input string. Namely, given query access to the compressed string, it is possible to efficiently recover any symbol of the input string. The compression algorithm is given as input a parameter $\eps >0$, and with very high probability increases the length of the compressed string by at most a factor of $(1+\eps)$. The access time is $O(\log n + 1/\eps^2)$ in expectation, and $O(\log n/\eps^2)$ with high probability. The scheme relies on sparse transitive-closure spanners. Any (consecutive) substring of the input string can be retrieved at an additional additive cost in the running time of the length of the substring. We also formally establish the necessity of modifying LZ78 so as to allow efficient random access. Specifically, we construct a family of strings for which $Ω(n/\log n)$ queries to the LZ78-compressed string are required in order to recover a single symbol in the input string. The main benefit of the proposed scheme is that it preserves the online nature and simplicity of LZ78, and that for {\em every} input string, the length of the compressed string is only a small factor larger than that obtained by running LZ78. △ Less

Submitted 11 January, 2013; originally announced January 2013.

arXiv:1208.2956 [pdf, ps, other]

Local reconstructors and tolerant testers for connectivity and diameter

Authors: Andrea Campagna, Alan Guo, Ronitt Rubinfeld

Abstract: A local property reconstructor for a graph property is an algorithm which, given oracle access to the adjacency list of a graph that is "close" to having the property, provides oracle access to the adjacency matrix of a "correction" of the graph, i.e. a graph which has the property and is close to the given graph. For this model, we achieve local property reconstructors for the properties of conne… ▽ More A local property reconstructor for a graph property is an algorithm which, given oracle access to the adjacency list of a graph that is "close" to having the property, provides oracle access to the adjacency matrix of a "correction" of the graph, i.e. a graph which has the property and is close to the given graph. For this model, we achieve local property reconstructors for the properties of connectivity and $k$-connectivity in undirected graphs, and the property of strong connectivity in directed graphs. Along the way, we present a method of transforming a local reconstructor (which acts as a "adjacency matrix oracle" for the corrected graph) into an "adjacency list oracle". This allows us to recursively use our local reconstructor for $(k-1)$-connectivity to obtain a local reconstructor for $k$-connectivity. We also extend this notion of local property reconstruction to parametrized graph properties (for instance, having diameter at most $D$ for some parameter $D$) and require that the corrected graph has the property with parameter close to the original. We obtain a local reconstructor for the low diameter property, where if the original graph is close to having diameter $D$, then the corrected graph has diameter roughly 2D. We also exploit a connection between local property reconstruction and property testing, observed by Brakerski, to obtain new tolerant property testers for all of the aforementioned properties. Except for the one for connectivity, these are the first tolerant property testers for these properties. △ Less

Submitted 21 June, 2013; v1 submitted 14 August, 2012; originally announced August 2012.

Comments: 21 pages, updated abstract, improved exposition

arXiv:1110.1079 [pdf, ps, other]

A Near-Optimal Sublinear-Time Algorithm for Approximating the Minimum Vertex Cover Size

Authors: Krzysztof Onak, Dana Ron, Michal Rosen, Ronitt Rubinfeld

Abstract: We give a nearly optimal sublinear-time algorithm for approximating the size of a minimum vertex cover in a graph G. The algorithm may query the degree deg(v) of any vertex v of its choice, and for each 1 <= i <= deg(v), it may ask for the i-th neighbor of v. Letting VC_opt(G) denote the minimum size of vertex cover in G, the algorithm outputs, with high constant success probability, an estimate V… ▽ More We give a nearly optimal sublinear-time algorithm for approximating the size of a minimum vertex cover in a graph G. The algorithm may query the degree deg(v) of any vertex v of its choice, and for each 1 <= i <= deg(v), it may ask for the i-th neighbor of v. Letting VC_opt(G) denote the minimum size of vertex cover in G, the algorithm outputs, with high constant success probability, an estimate VC_estimate(G) such that VC_opt(G) <= VC_estimate(G) <= 2 * VC_opt(G) + epsilon*n, where epsilon is a given additive approximation parameter. We refer to such an estimate as a (2,epsilon)-estimate. The query complexity and running time of the algorithm are ~O(avg_deg * poly(1/epsilon)), where avg_deg denotes the average vertex degree in the graph. The best previously known sublinear algorithm, of Yoshida et al. (STOC 2009), has query complexity and running time O(d^4/epsilon^2), where d is the maximum degree in the graph. Given the lower bound of Omega(avg_deg) (for constant epsilon) for obtaining such an estimate (with any constant multiplicative factor) due to Parnas and Ron (TCS 2007), our result is nearly optimal. In the case that the graph is dense, that is, the number of edges is Theta(n^2), we consider another model, in which the algorithm may ask, for any pair of vertices u and v, whether there is an edge between u and v. We show how to adapt the algorithm that uses neighbor queries to this model and obtain an algorithm that outputs a (2,epsilon)-estimate of the size of a minimum vertex cover whose query complexity and running time are ~O(n) * poly(1/epsilon). △ Less

Submitted 5 October, 2011; originally announced October 2011.

arXiv:1109.6178 [pdf, ps, other]

Space-efficient Local Computation Algorithms

Authors: Noga Alon, Ronitt Rubinfeld, Shai Vardi, Ning Xie

Abstract: Recently Rubinfeld et al. (ICS 2011, pp. 223--238) proposed a new model of sublinear algorithms called \emph{local computation algorithms}. In this model, a computation problem $F$ may have more than one legal solution and each of them consists of many bits. The local computation algorithm for $F$ should answer in an online fashion, for any index $i$, the $i^{\mathrm{th}}$ bit of some legal soluti… ▽ More Recently Rubinfeld et al. (ICS 2011, pp. 223--238) proposed a new model of sublinear algorithms called \emph{local computation algorithms}. In this model, a computation problem $F$ may have more than one legal solution and each of them consists of many bits. The local computation algorithm for $F$ should answer in an online fashion, for any index $i$, the $i^{\mathrm{th}}$ bit of some legal solution of $F$. Further, all the answers given by the algorithm should be consistent with at least one solution of $F$. In this work, we continue the study of local computation algorithms. In particular, we develop a technique which under certain conditions can be applied to construct local computation algorithms that run not only in polylogarithmic time but also in polylogarithmic \emph{space}. Moreover, these local computation algorithms are easily parallelizable and can answer all parallel queries consistently. Our main technical tools are pseudorandom numbers with bounded independence and the theory of branching processes. △ Less

Submitted 29 November, 2011; v1 submitted 28 September, 2011; originally announced September 2011.

arXiv:1104.1377 [pdf, ps, other]

Fast Local Computation Algorithms

Authors: Ronitt Rubinfeld, Gil Tamir, Shai Vardi, Ning Xie

Abstract: For input $x$, let $F(x)$ denote the set of outputs that are the "legal" answers for a computational problem $F$. Suppose $x$ and members of $F(x)$ are so large that there is not time to read them in their entirety. We propose a model of {\em local computation algorithms} which for a given input $x$, support queries by a user to values of specified locations $y_i$ in a legal output $y \in F(x)$. W… ▽ More For input $x$, let $F(x)$ denote the set of outputs that are the "legal" answers for a computational problem $F$. Suppose $x$ and members of $F(x)$ are so large that there is not time to read them in their entirety. We propose a model of {\em local computation algorithms} which for a given input $x$, support queries by a user to values of specified locations $y_i$ in a legal output $y \in F(x)$. When more than one legal output $y$ exists for a given $x$, the local computation algorithm should output in a way that is consistent with at least one such $y$. Local computation algorithms are intended to distill the common features of several concepts that have appeared in various algorithmic subfields, including local distributed computation, local algorithms, locally decodable codes, and local reconstruction. We develop a technique, based on known constructions of small sample spaces of $k$-wise independent random variables and Beck's analysis in his algorithmic approach to the Lov{á}sz Local Lemma, which under certain conditions can be applied to construct local computation algorithms that run in {\em polylogarithmic} time and space. We apply this technique to maximal independent set computations, scheduling radio network broadcasts, hypergraph coloring and satisfying $k$-SAT formulas. △ Less

Submitted 7 April, 2011; originally announced April 2011.

Comments: A preliminary version of this paper appeared in ICS 2011, pp. 223-238

arXiv:1101.5345 [pdf, ps, other]

Approximating the Influence of a monotone Boolean function in O(\sqrt{n}) query complexity

Authors: Dana Ron, Ronitt Rubinfeld, Muli Safra, Omri Weinstein

Abstract: The {\em Total Influence} ({\em Average Sensitivity) of a discrete function is one of its fundamental measures. We study the problem of approximating the total influence of a monotone Boolean function \ifnum\plusminus=1 $f: \{\pm1\}^n \longrightarrow \{\pm1\}$, \else $f: \bitset^n \to \bitset$, \fi which we denote by $I[f]$. We present a randomized algorithm that approximates the influence of such… ▽ More The {\em Total Influence} ({\em Average Sensitivity) of a discrete function is one of its fundamental measures. We study the problem of approximating the total influence of a monotone Boolean function \ifnum\plusminus=1 $f: \{\pm1\}^n \longrightarrow \{\pm1\}$, \else $f: \bitset^n \to \bitset$, \fi which we denote by $I[f]$. We present a randomized algorithm that approximates the influence of such functions to within a multiplicative factor of $(1\pm \eps)$ by performing $O(\frac{\sqrt{n}\log n}{I[f]} \poly(1/\eps)) $ queries. % \mnote{D: say something about technique?} We also prove a lower bound of % $Ω(\frac{\sqrt{n/\log n}}{I[f]})$ $Ω(\frac{\sqrt{n}}{\log n \cdot I[f]})$ on the query complexity of any constant-factor approximation algorithm for this problem (which holds for $I[f] = Ω(1)$), % and $I[f] = O(\sqrt{n}/\log n)$), hence showing that our algorithm is almost optimal in terms of its dependence on $n$. For general functions we give a lower bound of $Ω(\frac{n}{I[f]})$, which matches the complexity of a simple sampling algorithm. △ Less

Submitted 27 January, 2011; originally announced January 2011.

arXiv:1009.5397 [pdf, ps, other]

Testing Closeness of Discrete Distributions

Authors: Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, Patrick White

Abstract: Given samples from two distributions over an $n$-element set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in $n$, specifically, $O(n^{2/3}ε^{-8/3}\log n)$, independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases… ▽ More Given samples from two distributions over an $n$-element set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in $n$, specifically, $O(n^{2/3}ε^{-8/3}\log n)$, independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than $\max\{ε^{4/3}n^{-1/3}/32, εn^{-1/2}/4\}$) or large (more than $ε$) in $\ell_1$ distance. This result can be compared to the lower bound of $Ω(n^{2/3}ε^{-2/3})$ for this problem given by Valiant. Our algorithm has applications to the problem of testing whether a given Markov process is rapidly mixing. We present sublinear for several variants of this problem as well. △ Less

Submitted 4 November, 2010; v1 submitted 27 September, 2010; originally announced September 2010.

Comments: 26 pages, A preliminary version of this paper appeared in the 41st Symposium on Foundations of Computer Science, 2000, Redondo Beach, CA, A comment from W.D. Smith has been added on the title page

ACM Class: F.2.2; G.3

arXiv:0904.0292 [pdf, ps, other]

Sublinear Time Algorithms for Earth Mover's Distance

Authors: Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, Ronitt Rubinfeld

Abstract: We study the problem of estimating the Earth Mover's Distance (EMD) between probability distributions when given access only to samples. We give closeness testers and additive-error estimators over domains in $[0, Δ]^d$, with sample complexities independent of domain size - permitting the testability even of continuous distributions over infinite domains. Instead, our algorithms depend on other… ▽ More We study the problem of estimating the Earth Mover's Distance (EMD) between probability distributions when given access only to samples. We give closeness testers and additive-error estimators over domains in $[0, Δ]^d$, with sample complexities independent of domain size - permitting the testability even of continuous distributions over infinite domains. Instead, our algorithms depend on other parameters, such as the diameter of the domain space, which may be significantly smaller. We also prove lower bounds showing the dependencies on these parameters to be essentially optimal. Additionally, we consider whether natural classes of distributions exist for which there are algorithms with better dependence on the dimension, and show that for highly clusterable data, this is indeed the case. Lastly, we consider a variant of the EMD, defined over tree metrics instead of the usual L1 metric, and give optimal algorithms. △ Less

Submitted 1 April, 2009; originally announced April 2009.

Comments: 12 pages

arXiv:0706.1084 [pdf, ps, other]

Sublinear Algorithms for Approximating String Compressibility

Authors: Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, Adam Smith

Abstract: We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that sh… ▽ More We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly. Our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. In addition, we show that approximating the compressibility with respect to LZ is related to approximating the support size of a distribution. △ Less

Submitted 7 June, 2007; originally announced June 2007.

Comments: To appear in the proceedings of RANDOM 2007

Showing 1–43 of 43 results for author: Rubinfeld, R