Search | arXiv e-print repository

Scalable Scheduling Policies for Quantum Satellite Networks

Authors: Albert Williams, Nitish K. Panigrahy, Andrew McGregor, Don Towsley

Abstract: As Low Earth Orbit (LEO) satellite mega constellations continue to be deployed for satellite internet and recent successful experiments in satellite-based quantum entanglement distribution emerge, a natural question arises: How should we coordinate transmissions and design scalable scheduling policies for a quantum satellite internet? In this work, we consider the problem of transmission schedulin… ▽ More As Low Earth Orbit (LEO) satellite mega constellations continue to be deployed for satellite internet and recent successful experiments in satellite-based quantum entanglement distribution emerge, a natural question arises: How should we coordinate transmissions and design scalable scheduling policies for a quantum satellite internet? In this work, we consider the problem of transmission scheduling in quantum satellite networks subject to resource constraints at the satellites and ground stations. We show that the most general problem of assigning satellites to ground station pairs for entanglement distribution is NP-hard. We then propose four heuristic algorithms and evaluate their performance for Starlink mega constellation under various amount of resources and placements of the ground stations. We find that the maximum number of receivers necessary per ground station grows very slowly with the total number of deployed ground stations. Our proposed algorithms, leveraging optimal weighted b-matching and the global greedy heuristic, outperform others in entanglement distribution rate, entanglement fidelity, and handover cost metrics. While we develop these scheduling algorithms, we have also designed a software system to simulate, visualize, and evaluate satellite mega-constellations for entanglement distribution. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.04261 [pdf, ps, other]

Graph Reconstruction from Noisy Random Subgraphs

Authors: Andrew McGregor, Rik Sengupta

Abstract: We consider the problem of reconstructing an undirected graph $G$ on $n$ vertices given multiple random noisy subgraphs or "traces". Specifically, a trace is generated by sampling each vertex with probability $p_v$, then taking the resulting induced subgraph on the sampled vertices, and then adding noise in the form of either (a) deleting each edge in the subgraph with probability $1-p_e$, or (b)… ▽ More We consider the problem of reconstructing an undirected graph $G$ on $n$ vertices given multiple random noisy subgraphs or "traces". Specifically, a trace is generated by sampling each vertex with probability $p_v$, then taking the resulting induced subgraph on the sampled vertices, and then adding noise in the form of either (a) deleting each edge in the subgraph with probability $1-p_e$, or (b) deleting each edge with probability $f_e$ and transforming a non-edge into an edge with probability $f_e$. We show that, under mild assumptions on $p_v$, $p_e$ and $f_e$, if $G$ is selected uniformly at random, then $O(p_e^{-1} p_v^{-2} \log n)$ or $O((f_e-1/2)^{-2} p_v^{-2} \log n)$ traces suffice to reconstruct $G$ with high probability. In contrast, if $G$ is arbitrary, then $\exp(Ω(n))$ traces are necessary even when $p_v=1, p_e=1/2$. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 6 pages, to appear in ISIT 2024

arXiv:2403.14087 [pdf, ps, other]

Improved Algorithms for Maximum Coverage in Dynamic and Random Order Streams

Authors: Amit Chakrabarti, Andrew McGregor, Anthony Wirth

Abstract: The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-ε)$-approximation algorithms for this NP-hard problem in three standard data stream models. 1. {\em Dynamic Model.} The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses… ▽ More The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-ε)$-approximation algorithms for this NP-hard problem in three standard data stream models. 1. {\em Dynamic Model.} The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses $ε^{-2} k \cdot \text{polylog}(n,m)$ space. The best previous result (Assadi and Khanna, SODA 2018) used $(n +ε^{-4} k) \text{polylog}(n,m)$ space. While both algorithms use $O(ε^{-1} \log n)$ passes, our analysis shows that when $ε$ is a constant, it is possible to reduce the number of passes by a $1/\log \log n$ factor without incurring additional space. 2. {\em Random Order Model.} In this model, there are no deletions and the sets forming the instance are uniformly randomly permuted to form the input stream. We show that a single pass and $k \text{polylog}(n,m)$ space suffices for arbitrary small constant $ε$. The best previous result, by Warneke et al.~(ESA 2023), used $k^2 \text{polylog}(n,m)$ space. 3. {\em Insert-Only Model.} Lastly, our results, along with numerous previous results, use a sub-sampling technique introduced by McGregor and Vu (ICDT 2017) to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: $\text{polylog}(m,n)$ update time suffices whereas the best previous result by Jaud et al.~(SEA 2023) required update time that was linear in $k$. △ Less

Submitted 20 March, 2024; originally announced March 2024.

ACM Class: F.2.2

arXiv:2307.12482 [pdf, ps, other]

Tight Approximations for Graphical House Allocation

Authors: Hadi Hosseini, Andrew McGregor, Rik Sengupta, Rohit Vaish, Vignesh Viswanathan

Abstract: The Graphical House Allocation problem asks: how can $n$ houses (each with a fixed non-negative value) be assigned to the vertices of an undirected graph $G$, so as to minimize the "aggregate local envy", i.e., the sum of absolute differences along the edges of $G$? This problem generalizes the classical Minimum Linear Arrangement problem, as well as the well-known House Allocation Problem from Ec… ▽ More The Graphical House Allocation problem asks: how can $n$ houses (each with a fixed non-negative value) be assigned to the vertices of an undirected graph $G$, so as to minimize the "aggregate local envy", i.e., the sum of absolute differences along the edges of $G$? This problem generalizes the classical Minimum Linear Arrangement problem, as well as the well-known House Allocation Problem from Economics, the latter of which has notable practical applications in organ exchanges. Recent work has studied the computational aspects of Graphical House Allocation and observed that the problem is NP-hard and inapproximable even on particularly simple classes of graphs, such as vertex disjoint unions of paths. However, the dependence of any approximations on the structural properties of the underlying graph had not been studied. In this work, we give a complete characterization of the approximability of the Graphical House Allocation problem. We present algorithms to approximate the optimal envy on general graphs, trees, planar graphs, bounded-degree graphs, bounded-degree planar graphs, and bounded-degree trees. For each of these graph classes, we then prove matching lower bounds, showing that in each case, no significant improvement can be attained unless P = NP. We also present general approximation ratios as a function of structural parameters of the underlying graph, such as treewidth; these match the aforementioned tight upper bounds in general, and are significantly better approximations for many natural subclasses of graphs. Finally, we present constant factor approximation schemes for the special classes of complete binary trees and random graphs. △ Less

Submitted 12 October, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

arXiv:2211.06536 [pdf, other]

Improving the Efficiency of the PC Algorithm by Using Model-Based Conditional Independence Tests

Authors: Erica Cai, Andrew McGregor, David Jensen

Abstract: Learning causal structure is useful in many areas of artificial intelligence, including planning, robotics, and explanation. Constraint-based structure learning algorithms such as PC use conditional independence (CI) tests to infer causal structure. Traditionally, constraint-based algorithms perform CI tests with a preference for smaller-sized conditioning sets, partially because the statistical p… ▽ More Learning causal structure is useful in many areas of artificial intelligence, including planning, robotics, and explanation. Constraint-based structure learning algorithms such as PC use conditional independence (CI) tests to infer causal structure. Traditionally, constraint-based algorithms perform CI tests with a preference for smaller-sized conditioning sets, partially because the statistical power of conventional CI tests declines rapidly as the size of the conditioning set increases. However, many modern conditional independence tests are model-based, and these tests use well-regularized models that maintain statistical power even with very large conditioning sets. This suggests an intriguing new strategy for constraint-based algorithms which may result in a reduction of the total number of CI tests performed: Test variable pairs with large conditioning sets first, as a pre-processing step that finds some conditional independencies quickly, before moving on to the more conventional strategy that favors small conditioning sets. We propose such a pre-processing step for the PC algorithm which relies on performing CI tests on a few randomly selected large conditioning sets. We perform an empirical analysis on directed acyclic graphs (DAGs) that correspond to real-world systems and both empirical and theoretical analyses for Erdős-Renyi DAGs. Our results show that Pre-Processing Plus PC (P3PC) performs far fewer CI tests than the original PC algorithm, between 0.5% to 36%, and often less than 10%, of the CI tests that the PC algorithm alone performs. The efficiency gains are particularly significant for the DAGs corresponding to real-world systems. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: Accepted at NeurIPS 2022 Workshop on Causality for Real-world Impact; 8 pages of main text including references

arXiv:2207.02817 [pdf, ps, other]

Non-Adaptive Edge Counting and Sampling via Bipartite Independent Set Queries

Authors: Raghavendra Addanki, Andrew McGregor, Cameron Musco

Abstract: We study the problem of estimating the number of edges in an $n$-vertex graph, accessed via the Bipartite Independent Set query model introduced by Beame et al. (ITCS '18). In this model, each query returns a Boolean, indicating the existence of at least one edge between two specified sets of nodes. We present a non-adaptive algorithm that returns a $(1\pm ε)$ relative error approximation to the n… ▽ More We study the problem of estimating the number of edges in an $n$-vertex graph, accessed via the Bipartite Independent Set query model introduced by Beame et al. (ITCS '18). In this model, each query returns a Boolean, indicating the existence of at least one edge between two specified sets of nodes. We present a non-adaptive algorithm that returns a $(1\pm ε)$ relative error approximation to the number of edges, with query complexity $\tilde O(ε^{-5}\log^{5} n )$, where $\tilde O(\cdot)$ hides $\textrm{poly}(\log \log n)$ dependencies. This is the first non-adaptive algorithm in this setting achieving $\textrm{poly}(1/ε,\log n)$ query complexity. Prior work requires $Ω(\log^2 n)$ rounds of adaptivity. We avoid this by taking a fundamentally different approach, inspired by work on single-pass streaming algorithms. Moreover, for constant $ε$, our query complexity significantly improves on the best known adaptive algorithm due to Bhattacharya et al. (STACS '22), which requires $O(ε^{-2} \log^{11} n)$ queries. Building on our edge estimation result, we give the first non-adaptive algorithm for outputting a nearly uniformly sampled edge with query complexity $\tilde O(ε^{-6} \log^{6} n)$, improving on the works of Dell et al. (SODA '20) and Bhattacharya et al. (STACS '22), which require $Ω(\log^3 n)$ rounds of adaptivity. Finally, as a consequence of our edge sampling algorithm, we obtain a $\tilde O(n\log^ 8 n)$ query algorithm for connectivity, using two rounds of adaptivity. This improves on a three-round algorithm of Assadi et al. (ESA '21) and is tight; there is no non-adaptive algorithm for connectivity making $o(n^2)$ queries. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: European Symposium on Algorithms (ESA) 2022

arXiv:2205.09804 [pdf, ps, other]

Estimation of Entropy in Constant Space with Improved Sample Complexity

Authors: Maryam Aliakbarpour, Andrew McGregor, Jelani Nelson, Erik Waingarten

Abstract: Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pmε$ additive error by streaming over $(k/ε^3) \cdot \text{polylog}(1/ε)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/ε^2)\cdot \text{polylog}(1/ε)$.… ▽ More Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pmε$ additive error by streaming over $(k/ε^3) \cdot \text{polylog}(1/ε)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/ε^2)\cdot \text{polylog}(1/ε)$. We conjecture that this is optimal up to $\text{polylog}(1/ε)$ factors. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2201.06678 [pdf, other]

Improved Approximation and Scalability for Fair Max-Min Diversification

Authors: Raghavendra Addanki, Andrew McGregor, Alexandra Meliou, Zafeiria Moumoulidou

Abstract: Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT… ▽ More Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-ε$ for any constant $ε$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $ε>0$, we present a $1+ε$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-ε) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing. △ Less

Submitted 17 January, 2022; originally announced January 2022.

Comments: To appear in ICDT 2022

arXiv:2105.08215 [pdf, ps, other]

Vertex Ordering Problems in Directed Graph Streams

Authors: Amit Chakrabarti, Prantar Ghosh, Andrew McGregor, Sofya Vorotnikova

Abstract: We consider directed graph algorithms in a streaming setting, focusing on problems concerning orderings of the vertices. This includes such fundamental problems as topological sorting and acyclicity testing. We also study the related problems of finding a minimum feedback arc set (edges whose removal yields an acyclic graph), and finding a sink vertex. We are interested in both adversarially-order… ▽ More We consider directed graph algorithms in a streaming setting, focusing on problems concerning orderings of the vertices. This includes such fundamental problems as topological sorting and acyclicity testing. We also study the related problems of finding a minimum feedback arc set (edges whose removal yields an acyclic graph), and finding a sink vertex. We are interested in both adversarially-ordered and randomly-ordered streams. For arbitrary input graphs with edges ordered adversarially, we show that most of these problems have high space complexity, precluding sublinear-space solutions. Some lower bounds also apply when the stream is randomly ordered: e.g., in our most technical result we show that testing acyclicity in the $p$-pass random-order model requires roughly $n^{1+1/p}$ space. For other problems, random ordering can make a dramatic difference: e.g., it is possible to find a sink in an acyclic tournament in the one-pass random-order model using polylog$(n)$ space whereas under adversarial ordering roughly $n^{1/p}$ space is necessary and sufficient given $Θ(p)$ passes. We also design sublinear algorithms for the feedback arc set problem in tournament graphs; for random graphs; and for randomly ordered streams. In some cases, we give lower bounds establishing that our algorithms are essentially space-optimal. Together, our results complement the much maturer body of work on algorithms for undirected graph streams. △ Less

Submitted 17 May, 2021; originally announced May 2021.

Comments: Appeared in SODA 2020

arXiv:2102.08476 [pdf, other]

Maximum Coverage in the Data Stream Model: Parameterized and Generalized

Authors: Andrew McGregor, David Tench, Hoa T. Vu

Abstract: We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are $m$ subsets of a universe of size $n$ and a value $k\in [m]$. In Max-Cover, the problem is to find a collection of at most $k$ sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at mos… ▽ More We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are $m$ subsets of a universe of size $n$ and a value $k\in [m]$. In Max-Cover, the problem is to find a collection of at most $k$ sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most $k$ sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most $d$, there exist single-pass algorithms using $\tilde{O}(d^{d+1} k^d)$ space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant $d$. If each element appears in at most $r$ sets, we present single pass algorithms using $\tilde{O}(k^2 r/ε^3)$ space that return a $1+ε$ approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e., $\tilde{O}(k^3 r/ε^{4})$ space, that $1+ε$ approximates Max-Unique-Cover. In contrast to the above results, when $d$ and $r$ are arbitrary, any constant pass $1+ε$ approximation algorithm for either problem requires $Ω(ε^{-2}m)$ space but a single pass $O(ε^{-2}mk)$ space algorithm exists. In fact any constant-pass algorithm with an approximation better than $e/(e-1)$ and $e^{1-1/k}$ for Max-Cover and Max-Unique-Cover respectively requires $Ω(m/k^2)$ space when $d$ and $r$ are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: Conference version to appear at ICDT 2021

arXiv:2012.13976 [pdf, ps, other]

Intervention Efficient Algorithms for Approximate Learning of Causal Graphs

Authors: Raghavendra Addanki, Andrew McGregor, Cameron Musco

Abstract: We study the problem of learning the causal relationships between a set of observed variables in the presence of latents, while minimizing the cost of interventions on the observed variables. We assume access to an undirected graph $G$ on the observed variables whose edges represent either all direct causal relationships or, less restrictively, a superset of causal relationships (identified, e.g.,… ▽ More We study the problem of learning the causal relationships between a set of observed variables in the presence of latents, while minimizing the cost of interventions on the observed variables. We assume access to an undirected graph $G$ on the observed variables whose edges represent either all direct causal relationships or, less restrictively, a superset of causal relationships (identified, e.g., via conditional independence tests or a domain expert). Our goal is to recover the directions of all causal or ancestral relations in $G$, via a minimum cost set of interventions. It is known that constructing an exact minimum cost intervention set for an arbitrary graph $G$ is NP-hard. We further argue that, conditioned on the hardness of approximate graph coloring, no polynomial time algorithm can achieve an approximation factor better than $Θ(\log n)$, where $n$ is the number of observed variables in $G$. To overcome this limitation, we introduce a bi-criteria approximation goal that lets us recover the directions of all but $εn^2$ edges in $G$, for some specified error parameter $ε> 0$. Under this relaxed goal, we give polynomial time algorithms that achieve intervention cost within a small constant factor of the optimal. Our algorithms combine work on efficient intervention design and the design of low-cost separating set systems, with ideas from the literature on graph property testing. △ Less

Submitted 27 December, 2020; originally announced December 2020.

Comments: To appear, International Conference on Algorithmic Learning Theory(ALT) 2021

arXiv:2010.09141 [pdf, other]

Diverse Data Selection under Fairness Constraints

Authors: Zafeiria Moumoulidou, Andrew McGregor, Alexandra Meliou

Abstract: Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe $U$ of $n$ elements that can… ▽ More Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe $U$ of $n$ elements that can be partitioned into $m$ disjoint groups, we aim to retrieve a $k$-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified $k_i$ number of elements from each group $i$ (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in $n$, that provide strong theoretical approximation guarantees for different values of $m$ and $k$. Finally, we extend our algorithms and analysis to the case where groups can be overlap**. △ Less

Submitted 18 October, 2020; originally announced October 2020.

arXiv:2005.11736 [pdf, other]

Efficient Intervention Design for Causal Discovery with Latents

Authors: Raghavendra Addanki, Shiva Prasad Kasiviswanathan, Andrew McGregor, Cameron Musco

Abstract: We consider recovering a causal graph in presence of latent variables, where we seek to minimize the cost of interventions used in the recovery process. We consider two intervention cost models: (1) a linear cost model where the cost of an intervention on a subset of variables has a linear form, and (2) an identity cost model where the cost of an intervention is the same, regardless of what variab… ▽ More We consider recovering a causal graph in presence of latent variables, where we seek to minimize the cost of interventions used in the recovery process. We consider two intervention cost models: (1) a linear cost model where the cost of an intervention on a subset of variables has a linear form, and (2) an identity cost model where the cost of an intervention is the same, regardless of what variables it is on, i.e., the goal is just to minimize the number of interventions. Under the linear cost model, we give an algorithm to identify the ancestral relations of the underlying causal graph, achieving within a $2$-factor of the optimal intervention cost. This approximation factor can be improved to $1+ε$ for any $ε> 0$ under some mild restrictions. Under the identity cost model, we bound the number of interventions needed to recover the entire causal graph, including the latent variables, using a parameterization of the causal graph through a special type of colliders. In particular, we introduce the notion of $p$-colliders, that are colliders between pair of nodes arising from a specific type of conditioning in the causal graph, and provide an upper bound on the number of interventions as a function of the maximum number of $p$-colliders between any two nodes in the causal graph. △ Less

Submitted 12 July, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

Comments: International Conference on Machine Learning 2020

arXiv:2002.11661 [pdf, other]

Data Structures & Algorithms for Exact Inference in Hierarchical Clustering

Authors: Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, Ji-Ah Lee, Patrick Flaherty, Kyle Cranmer, Andrew McGregor, Andrew McCallum

Abstract: Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. Typically approximate algorithms are used for inference due to the combinatorial number of possible hierarchical clusterings. In contrast to existing methods, we present novel… ▽ More Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. Typically approximate algorithms are used for inference due to the combinatorial number of possible hierarchical clusterings. In contrast to existing methods, we present novel dynamic-programming algorithms for \emph{exact} inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that compares well to other benchmarks. Exact methods are relevant to data analyses in particle physics and for finding correlations among gene expression in cancer genomics, and we give examples in both areas, where our algorithms outperform greedy and beam search baselines. In addition, we consider Dasgupta's cost with synthetic data. △ Less

Submitted 22 October, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

Comments: 27 pages, 12 figures

arXiv:2001.06776 [pdf, ps, other]

Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

Authors: Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, Soumyabrata Pal

Abstract: We present two different approaches for parameter learning in several mixture models in one dimension. Our first approach uses complex-analytic methods and applies to Gaussian mixtures with shared variance, binomial mixtures with shared success probability, and Poisson mixtures, among others. An example result is that $\exp(O(N^{1/3}))$ samples suffice to exactly learn a mixture of $k<N$ Poisson d… ▽ More We present two different approaches for parameter learning in several mixture models in one dimension. Our first approach uses complex-analytic methods and applies to Gaussian mixtures with shared variance, binomial mixtures with shared success probability, and Poisson mixtures, among others. An example result is that $\exp(O(N^{1/3}))$ samples suffice to exactly learn a mixture of $k<N$ Poisson distributions, each with integral rate parameters bounded by $N$. Our second approach uses algebraic and combinatorial tools and applies to binomial mixtures with shared trial parameter $N$ and differing success parameters, as well as to mixtures of geometric distributions. Again, as an example, for binomial mixtures with $k$ components and success parameters discretized to resolution $ε$, $O(k^2(N/ε)^{8/\sqrtε})$ samples suffice to exactly recover the parameters. For some of these distributions, our results represent the first guarantees for parameter estimation. △ Less

Submitted 19 January, 2020; originally announced January 2020.

Comments: 22 pages, Accepted at Algorithmic Learning Theory (ALT) 2020

arXiv:1910.14106 [pdf, ps, other]

Sample Complexity of Learning Mixtures of Sparse Linear Regressions

Authors: Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, Soumyabrata Pal

Abstract: In the problem of learning mixtures of linear regressions, the goal is to learn a collection of signal vectors from a sequence of (possibly noisy) linear measurements, where each measurement is evaluated on an unknown signal drawn uniformly from this collection. This setting is quite expressive and has been studied both in terms of practical applications and for the sake of establishing theoretica… ▽ More In the problem of learning mixtures of linear regressions, the goal is to learn a collection of signal vectors from a sequence of (possibly noisy) linear measurements, where each measurement is evaluated on an unknown signal drawn uniformly from this collection. This setting is quite expressive and has been studied both in terms of practical applications and for the sake of establishing theoretical guarantees. In this paper, we consider the case where the signal vectors are sparse; this generalizes the popular compressed sensing paradigm. We improve upon the state-of-the-art results as follows: In the noisy case, we resolve an open question of Yin et al. (IEEE Transactions on Information Theory, 2019) by showing how to handle collections of more than two vectors and present the first robust reconstruction algorithm, i.e., if the signals are not perfectly sparse, we still learn a good sparse approximation of the signals. In the noiseless case, as well as in the noisy case, we show how to circumvent the need for a restrictive assumption required in the previous work. Our techniques are quite different from those in the previous work: for the noiseless case, we rely on a property of sparse polynomials and for the noisy case, we provide new connections to learning Gaussian mixtures and use ideas from the theory of error-correcting codes. △ Less

Submitted 30 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019

arXiv:1904.09618 [pdf, ps, other]

Trace Reconstruction: Generalized and Parameterized

Authors: Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, Soumyabrata Pal

Abstract: In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string $x$ given random "traces" of $x$ where each trace is generated by deleting each coordinate of $x$ independently with probability $p<1$. The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is sti… ▽ More In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string $x$ given random "traces" of $x$ where each trace is generated by deleting each coordinate of $x$ independently with probability $p<1$. The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is still an exponential gap between upper and lower sample complexity bounds and our understanding of the problem is still surprisingly limited. In this paper, we consider natural parameterizations and generalizations of this problem in an effort to attain a deeper and more comprehensive understanding. We prove that $\exp(O(n^{1/4} \sqrt{\log n}))$ traces suffice for reconstructing arbitrary matrices. In the matrix version of the problem, each row and column of an unknown $\sqrt{n}\times \sqrt{n}$ matrix is deleted independently with probability $p$. Our results contrasts with the best known results for sequence reconstruction where the best known upper bound is $\exp(O(n^{1/3}))$. An optimal result for random matrix reconstruction: we show that $Θ(\log n)$ traces are necessary and sufficient. This is in contrast to the problem for random sequences where there is a super-logarithmic lower bound and the best known upper bound is $\exp({O}(\log^{1/3} n))$. We show that $\exp(O(k^{1/3}\log^{2/3} n))$ traces suffice to reconstruct $k$-sparse strings, providing an improvement over the best known sequence reconstruction results when $k = o(n/\log^2 n)$. We show that $\textrm{poly}(n)$ traces suffice if $x$ is $k$-sparse and we additionally have a "separation" promise, specifically that the indices of 1's in $x$ all differ by $Ω(k \log n)$. △ Less

Submitted 13 March, 2021; v1 submitted 21 April, 2019; originally announced April 2019.

arXiv:1902.04738 [pdf, other]

doi 10.1145/3314221.3314582

Mesh: Compacting Memory Management for C/C++ Applications

Authors: Bobby Powers, David Tench, Emery D. Berger, Andrew McGregor

Abstract: Programs written in C/C++ can suffer from serious memory fragmentation, leading to low utilization of memory, degraded performance, and application failure due to memory exhaustion. This paper introduces Mesh, a plug-in replacement for malloc that, for the first time, eliminates fragmentation in unmodified C/C++ applications. Mesh combines novel randomized algorithms with widely-supported virtual… ▽ More Programs written in C/C++ can suffer from serious memory fragmentation, leading to low utilization of memory, degraded performance, and application failure due to memory exhaustion. This paper introduces Mesh, a plug-in replacement for malloc that, for the first time, eliminates fragmentation in unmodified C/C++ applications. Mesh combines novel randomized algorithms with widely-supported virtual memory operations to provably reduce fragmentation, breaking the classical Robson bounds with high probability. Mesh generally matches the runtime performance of state-of-the-art memory allocators while reducing memory consumption; in particular, it reduces the memory of consumption of Firefox by 16% and Redis by 39%. △ Less

Submitted 16 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: Draft version, accepted at PLDI 2019

arXiv:1812.02023 [pdf, ps, other]

Correlation Clustering in Data Streams

Authors: Kook ** Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, Anthony Wirth

Abstract: Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consis… ▽ More Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(n\cdot \ \mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the "quality" of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(n\cdot \mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. △ Less

Submitted 5 December, 2018; originally announced December 2018.

arXiv:1706.09197 [pdf, other]

Storage Capacity as an Information-Theoretic Vertex Cover and the Index Coding Rate

Authors: Arya Mazumdar, Andrew McGregor, Sofya Vorotnikova

Abstract: Motivated by applications in distributed storage, the storage capacity of a graph was recently defined to be the maximum amount of information that can be stored across the vertices of a graph such that the information at any vertex can be recovered from the information stored at the neighboring vertices. Computing the storage capacity is a fundamental problem in network coding and is related, or… ▽ More Motivated by applications in distributed storage, the storage capacity of a graph was recently defined to be the maximum amount of information that can be stored across the vertices of a graph such that the information at any vertex can be recovered from the information stored at the neighboring vertices. Computing the storage capacity is a fundamental problem in network coding and is related, or equivalent, to some well-studied problems such as index coding with side information and generalized guessing games. In this paper, we consider storage capacity as a natural information-theoretic analogue of the minimum vertex cover of a graph. Indeed, while it was known that storage capacity is upper bounded by minimum vertex cover, we show that by treating it as such we can get a 3/2 approximation for planar graphs, and a 4/3 approximation for triangle-free planar graphs. Since the storage capacity is intimately related to the index coding rate, we get a 2 approximation of index coding rate for planar graphs and 3/2 approximation for triangle-free planar graphs. We also show a polynomial time approximation scheme for the index coding rate when the alphabet size is constant. We then develop a general method of "gadget covering" to upper bound the storage capacity in terms of the average of a set of vertex covers. This method is intuitive and leads to the exact characterization of storage capacity for various families of graphs. As an illustrative example, we use this approach to derive the exact storage capacity of cycles-with-chords, a family of graphs related to outerplanar graphs. Finally, we generalize the storage capacity notion to include recovery from partial node failures in distributed storage. We show tight upper and lower bounds on this partial recovery capacity that scales nicely with the fraction of failures in a vertex. △ Less

Submitted 17 December, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

Comments: A shorter version of this paper in the proceedings of the IEEE International Symposium on Information Theory, 2017 contains an error. The approximation factor for index coding rate for planar graphs was wrongly claimed to be 1.923. The correct approximation factor of our method is 2, and we have corrected Theorem 3 in this version

arXiv:1612.02531 [pdf, ps, other]

A Note on Logarithmic Space Stream Algorithms for Matchings in Low Arboricity Graphs

Authors: Andrew McGregor, Sofya Vorotnikova

Abstract: We present a data stream algorithm for estimating the size of the maximum matching of a low arboricity graph. Recall that a graph has arboricity $α$ if its edges can be partitioned into at most $α$ forests and that a planar graph has arboricity $α=3$. Estimating the size of the maximum matching in such graphs has been a focus of recent data stream research. A surprising result on this problem wa… ▽ More We present a data stream algorithm for estimating the size of the maximum matching of a low arboricity graph. Recall that a graph has arboricity $α$ if its edges can be partitioned into at most $α$ forests and that a planar graph has arboricity $α=3$. Estimating the size of the maximum matching in such graphs has been a focus of recent data stream research. A surprising result on this problem was recently proved by Cormode et al. They designed an ingenious algorithm that returned a $(22.5α+6)(1+ε)$ approximation using a single pass over the edges of the graph (ordered arbitrarily) and $O(ε^{-2}α\cdot \log n \cdot \log_{1+ε} n)$ space. In this note, we improve the approximation factor to $(α+2)(1+ε)$ via a tighter analysis and show that, with a modification of their algorithm, the space required can be reduced to $O(ε^{-2} \log n)$. △ Less

Submitted 14 August, 2017; v1 submitted 8 December, 2016; originally announced December 2016.

Comments: An update to the proof of Theorem 3. See paper for details

arXiv:1610.06199 [pdf, other]

Better Streaming Algorithms for the Maximum Coverage Problem

Authors: Andrew McGregor, Hoa T. Vu

Abstract: We study the classic NP-Hard problem of finding the maximum $k$-set coverage in the data stream model: given a set system of $m$ sets that are subsets of a universe $\{1,\ldots,n \}$, find the $k$ sets that cover the most number of distinct elements. The problem can be approximated up to a factor $1-1/e$ in polynomial time. In the streaming-set model, the sets and their elements are revealed onlin… ▽ More We study the classic NP-Hard problem of finding the maximum $k$-set coverage in the data stream model: given a set system of $m$ sets that are subsets of a universe $\{1,\ldots,n \}$, find the $k$ sets that cover the most number of distinct elements. The problem can be approximated up to a factor $1-1/e$ in polynomial time. In the streaming-set model, the sets and their elements are revealed online. The main goal of our work is to design algorithms, with approximation guarantees as close as possible to $1-1/e$, that use sublinear space $o(mn)$. Our main results are: Two $(1-1/e-ε)$ approximation algorithms: One uses $O(ε^{-1})$ passes and $\tilde{O}(ε^{-2} k)$ space whereas the other uses only a single pass but $\tilde{O}(ε^{-2} m)$ space. We show that any approximation factor better than $(1-(1-1/k)^k)$ in constant passes requires $Ω(m)$ space for constant $k$ even if the algorithm is allowed unbounded processing time. We also demonstrate a single-pass, $(1-ε)$ approximation algorithm using $\tilde{O}(ε^{-2} m \cdot \min(k,ε^{-1}))$ space. We also study the maximum $k$-vertex coverage problem in the dynamic graph stream model. In this model, the stream consists of edge insertions and deletions of a graph on $N$ vertices. The goal is to find $k$ vertices that cover the most number of distinct edges. We show that any constant approximation in constant passes requires $Ω(N)$ space for constant $k$ whereas $\tilde{O}(ε^{-2}N)$ space is sufficient for a $(1-ε)$ approximation and arbitrary $k$ in a single pass. For regular graphs, we show that $\tilde{O}(ε^{-3}k)$ space is sufficient for a $(1-ε)$ approximation in a single pass. We generalize this to a $(κ-ε)$ approximation when the ratio between the minimum and maximum degree is bounded below by $κ$. △ Less

Submitted 9 May, 2018; v1 submitted 19 October, 2016; originally announced October 2016.

Comments: - A preliminary version appeared in ICDT 2017 - Fix typos

arXiv:1604.01097 [pdf, other]

A Dispersion Minimized Mimetic Method for Cold Plasma

Authors: V. A. Bokil, V. Gyrya, D. A. McGregor

Abstract: In this paper we consider the lowest edge-based mimetic finite difference (MFD) discretization in space for Maxwell's equations in cold plasma on rectangular meshes. The method uses a generalized form of mass lum** that, on one hand, eliminates a need for linear solves at every iteration while, on the other hand, retains a set of free parameters of the MFD discretization. We perform an optimizat… ▽ More In this paper we consider the lowest edge-based mimetic finite difference (MFD) discretization in space for Maxwell's equations in cold plasma on rectangular meshes. The method uses a generalized form of mass lum** that, on one hand, eliminates a need for linear solves at every iteration while, on the other hand, retains a set of free parameters of the MFD discretization. We perform an optimization procedure, called m-adaptation, that identified a set of free parameters that lead to the smallest numerical dispersion. The choice of the time step** proved to be critical for successful optimization. Using exponential time differencing we were able to reduce the numerical dispersion error from second to fourth order of accuracy in mesh size. It was not possible to achieve this order of magnitude reduction in numerical dispersion error using the standard leapfrog time step**. Numerical simulations independently verify our theoretical findings. △ Less

Submitted 4 April, 2016; originally announced April 2016.

Comments: 19 pages, 5 figures, 2 tables

MSC Class: 65M06; 65M60; 78M20; 78M10

arXiv:1506.04417 [pdf, ps, other]

Densest Subgraph in Dynamic Graph Streams

Authors: Andrew McGregor, David Tench, Sofya Vorotnikova, Hoa T. Vu

Abstract: In this paper, we consider the problem of approximating the densest subgraph in the dynamic graph stream model. In this model of computation, the input graph is defined by an arbitrary sequence of edge insertions and deletions and the goal is to analyze properties of the resulting graph given memory that is sub-linear in the size of the stream. We present a single-pass algorithm that returns a… ▽ More In this paper, we consider the problem of approximating the densest subgraph in the dynamic graph stream model. In this model of computation, the input graph is defined by an arbitrary sequence of edge insertions and deletions and the goal is to analyze properties of the resulting graph given memory that is sub-linear in the size of the stream. We present a single-pass algorithm that returns a $(1+ε)$ approximation of the maximum density with high probability; the algorithm uses $O(ε^{-2} n \polylog n)$ space, processes each stream update in $\polylog (n)$ time, and uses $\poly(n)$ post-processing time where $n$ is the number of nodes. The space used by our algorithm matches the lower bound of Bahmani et al.~(PVLDB 2012) up to a poly-logarithmic factor for constant $ε$. The best existing results for this problem were established recently by Bhattacharya et al.~(STOC 2015). They presented a $(2+ε)$ approximation algorithm using similar space and another algorithm that both processed each update and maintained a $(4+ε)$ approximation of the current maximum density in $\polylog (n)$ time per-update. △ Less

Submitted 14 June, 2015; originally announced June 2015.

Comments: To appear in MFCS 2015

arXiv:1506.02574 [pdf, other]

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

Authors: Olivia Simpson, C. Seshadhri, Andrew McGregor

Abstract: The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the de… ▽ More The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail. △ Less

Submitted 25 November, 2015; v1 submitted 8 June, 2015; originally announced June 2015.

arXiv:1505.01731 [pdf, other]

Kernelization via Sampling with Applications to Dynamic Graph Streams

Authors: Rajesh Chitnis, Graham Cormode, Hossein Esfandiari, MohammadTaghi Hajiaghayi, Andrew McGregor, Morteza Monemizadeh, Sofya Vorotnikova

Abstract: In this paper we present a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: -- Match… ▽ More In this paper we present a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: -- Matching: First, there exists an $\tilde{O}(k^2)$ space algorithm that returns an exact maximum matching on the assumption the cardinality is at most $k$. The best previous algorithm used $\tilde{O}(kn)$ space where $n$ is the number of vertices in the graph and we prove our result is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. Second, there exists an $\tilde{O}(n^2/α^3)$ space algorithm that returns an $α$-approximation for matchings of arbitrary size. (Assadi et al. (2015) showed that this was optimal and independently and concurrently established the same upper bound.) We generalize both results for weighted matching. Third, there exists an $\tilde{O}(n^{4/5})$ space algorithm that returns a constant approximation in graphs with bounded arboricity. -- Vertex Cover and Hitting Set: There exists an $\tilde{O}(k^d)$ space algorithm that solves the minimum hitting set problem where $d$ is the cardinality of the input sets and $k$ is an upper bound on the size of the minimum hitting set. We prove this is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. The case $d=2$ corresponds to minimum vertex cover. Finally, we consider a larger family of parameterized problems (including $b$-matching, disjoint paths, vertex coloring among others) for which our subgraph sampling primitive yields fast, small-space dynamic graph stream algorithms. We then show lower bounds for natural problems outside this family. △ Less

Submitted 7 May, 2015; originally announced May 2015.

arXiv:1504.06501 [pdf, other]

Run Generation Revisited: What Goes Up May or May Not Come Down

Authors: Michael A. Bender, Samuel McCauley, Andrew McGregor, Shikha Singh, Hoa T. Vu

Abstract: In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible. We develop algorithms for minimizing the total number of runs (or equivalently, maxim… ▽ More In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible. We develop algorithms for minimizing the total number of runs (or equivalently, maximizing the average run length) when the runs are allowed to be sorted or reverse sorted. We study the problem in the online setting, both with and without resource augmentation, and in the offline setting. (1) We analyze alternating-up-down replacement selection (runs alternate between sorted and reverse sorted), which was studied by Knuth as far back as 1963. We show that this simple policy is asymptotically optimal. Specifically, we show that alternating-up-down replacement selection is 2-competitive and no deterministic online algorithm can perform better. (2) We give online algorithms having smaller competitive ratios with resource augmentation. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , is able to match or beat any optimal algorithm having a buffer of size M . Furthermore, we present a randomized online algorithm which is 7/4-competitive when given a buffer twice that of the optimal. (3) We demonstrate that performance can also be improved with a small amount of foresight. We give an algorithm, which is 3/2-competitive, with foreknowledge of the next 3M elements of the input stream. For the extreme case where all future elements are known, we design a PTAS for computing the optimal strategy a run generation algorithm must follow. (4) Finally, we present algorithms tailored for nearly sorted inputs which are guaranteed to have optimal solutions with sufficiently long runs. △ Less

Submitted 24 April, 2015; originally announced April 2015.

arXiv:1503.05225 [pdf, ps, other]

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Authors: Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, Suresh Venkatasubramanian

Abstract: Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications effic… ▽ More Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can't sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data. In this paper, we build these tools for information distances---both for the Hellinger distance and Jensen--Shannon divergence, as well as related measures, like the $χ^2$ divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and $χ^2$ divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen--Shannon, and $χ^2$ divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space. △ Less

Submitted 17 March, 2015; originally announced March 2015.

arXiv:1206.4668 [pdf]

Approximate Principal Direction Trees

Authors: Mark McCartin-Lim, Andrew McGregor, Rui Wang

Abstract: We introduce a new spatial data structure for high dimensional data called the \emph{approximate principal direction tree} (APD tree) that adapts to the intrinsic dimension of the data. Our algorithm ensures vector-quantization accuracy similar to that of computationally-expensive PCA trees with similar time-complexity to that of lower-accuracy RP trees. APD trees use a small number of power-met… ▽ More We introduce a new spatial data structure for high dimensional data called the \emph{approximate principal direction tree} (APD tree) that adapts to the intrinsic dimension of the data. Our algorithm ensures vector-quantization accuracy similar to that of computationally-expensive PCA trees with similar time-complexity to that of lower-accuracy RP trees. APD trees use a small number of power-method iterations to find splitting planes for recursively partitioning the data. As such they provide a natural trade-off between the running-time and accuracy achieved by RP and PCA trees. Our theoretical results establish a) strong performance guarantees regardless of the convergence rate of the power-method and b) that $O(\log d)$ iterations suffice to establish the guarantee of PCA trees when the intrinsic dimension is $d$. We demonstrate this trade-off and the efficacy of our data structure on both the CPU and GPU. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

arXiv:1004.3304 [pdf, ps, other]

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition

Authors: Amit Chakrabarti, Graham Cormode, Ranganath Kondapally, Andrew McGregor

Abstract: This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTED-INDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTED-INDEX may violate the "rectangle prope… ▽ More This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTED-INDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTED-INDEX may violate the "rectangle property" due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] that asked about the multi-pass complexity of recognizing Dyck languages. This results in a natural separation between the standard multi-pass model and the multi-pass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and double-ended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important sub-class of the memory checking framework of Blum et al. [Algorithmica, 1994]. △ Less

Submitted 19 April, 2010; originally announced April 2010.

arXiv:0912.4742 [pdf, other]

Optimizing Histogram Queries under Differential Privacy

Authors: Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, Andrew McGregor

Abstract: Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known. We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We prop… ▽ More Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known. We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We propose a general framework in which query strategies are formed from linear combinations of counting queries, and we describe an optimal method for deriving new query answers from the answers to the strategy queries. Using this framework we characterize the error of strategies geometrically, and we propose solutions to the problem of finding optimal strategies. △ Less

Submitted 6 September, 2010; v1 submitted 23 December, 2009; originally announced December 2009.

Comments: 22 pages, 1 figure

arXiv:0808.2222 [pdf, ps, other]

Better Bounds for Frequency Moments in Random-Order Streams

Authors: Alexandr Andoni, Andrew McGregor, Krzysztof Onak, Rina Panigrahy

Abstract: Estimating frequency moments of data streams is a very well studied problem and tight bounds are known on the amount of space that is necessary and sufficient when the stream is adversarially ordered. Recently, motivated by various practical considerations and applications in learning and statistics, there has been growing interest into studying streams that are randomly ordered. In the paper we… ▽ More Estimating frequency moments of data streams is a very well studied problem and tight bounds are known on the amount of space that is necessary and sufficient when the stream is adversarially ordered. Recently, motivated by various practical considerations and applications in learning and statistics, there has been growing interest into studying streams that are randomly ordered. In the paper we improve the previous lower bounds on the space required to estimate the frequency moments of a randomly ordered streams. △ Less

Submitted 15 August, 2008; originally announced August 2008.

Comments: 4 pages

arXiv:0710.0083 [pdf, ps, other]

Sorting and Selection with Random Costs

Authors: Stanislav Angelov, Keshav Kunal, Andrew McGregor

Abstract: There is a growing body of work on sorting and selection in models other than the unit-cost comparison model. This work is the first treatment of a natural stochastic variant of the problem where the cost of comparing two elements is a random variable. Each cost is chosen independently and is known to the algorithm. In particular we consider the following three models: each cost is chosen unifor… ▽ More There is a growing body of work on sorting and selection in models other than the unit-cost comparison model. This work is the first treatment of a natural stochastic variant of the problem where the cost of comparing two elements is a random variable. Each cost is chosen independently and is known to the algorithm. In particular we consider the following three models: each cost is chosen uniformly in the range $[0,1]$, each cost is 0 with some probability $p$ and 1 otherwise, or each cost is 1 with probability $p$ and infinite otherwise. We present lower and upper bounds (optimal in most cases) for these problems. We obtain our upper bounds by carefully designing algorithms to ensure that the costs incurred at various stages are independent and using properties of random partial orders when appropriate. △ Less

Submitted 29 September, 2007; originally announced October 2007.

arXiv:0704.2258 [pdf, ps, other]

On the Hardness of Approximating Stop** and Trap** Sets in LDPC Codes

Authors: Andrew McGregor, Olgica Milenkovic

Abstract: We prove that approximating the size of stop** and trap** sets in Tanner graphs of linear block codes, and more restrictively, the class of low-density parity-check (LDPC) codes, is NP-hard. The ramifications of our findings are that methods used for estimating the height of the error-floor of moderate- and long-length LDPC codes based on stop** and trap** set enumeration cannot provide… ▽ More We prove that approximating the size of stop** and trap** sets in Tanner graphs of linear block codes, and more restrictively, the class of low-density parity-check (LDPC) codes, is NP-hard. The ramifications of our findings are that methods used for estimating the height of the error-floor of moderate- and long-length LDPC codes based on stop** and trap** set enumeration cannot provide accurate worst-case performance predictions. △ Less

Submitted 3 August, 2008; v1 submitted 17 April, 2007; originally announced April 2007.

Comments: 16 pages, 6 figure, submitted journal version

arXiv:cs/0612031 [pdf, ps, other]

Estimating Aggregate Properties on Probabilistic Streams

Authors: Andrew McGregor, S. Muthukrishnan

Abstract: The probabilistic-stream model was introduced by Jayram et al. \cite{JKV07}. It is a generalization of the data stream model that is suited to handling ``probabilistic'' data where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over potentially a very large number of classical "deterministic… ▽ More The probabilistic-stream model was introduced by Jayram et al. \cite{JKV07}. It is a generalization of the data stream model that is suited to handling ``probabilistic'' data where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over potentially a very large number of classical "deterministic" streams where each item is deterministically one of the domain values. The probabilistic model is applicable for not only analyzing streams where the input has uncertainties (such as sensor data streams that measure physical processes) but also where the streams are derived from the input data by post-processing, such as tagging or reconciling inconsistent and poor quality data. We present streaming algorithms for computing commonly used aggregates on a probabilistic stream. We present the first known, one pass streaming algorithm for estimating the \AVG, improving results in \cite{JKV07}. We present the first known streaming algorithms for estimating the number of \DISTINCT items on probabilistic streams. Further, we present extensions to other aggregates such as the repeat rate, quantiles, etc. In all cases, our algorithms work with provable accuracy guarantees and within the space constraints of the data stream model. △ Less

Submitted 5 December, 2006; originally announced December 2006.

Comments: 11 pages

arXiv:cs/0508122 [pdf, ps, other]

Streaming and Sublinear Approximation of Entropy and Information Distances

Authors: Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian

Abstract: In many problems in data mining and machine learning, data items that need to be clustered or classified are not points in a high-dimensional space, but are distributions (points on a high dimensional simplex). For distributions, natural measures of distance are not the $\ell_p$ norms and variants, but information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and… ▽ More In many problems in data mining and machine learning, data items that need to be clustered or classified are not points in a high-dimensional space, but are distributions (points on a high dimensional simplex). For distributions, natural measures of distance are not the $\ell_p$ norms and variants, but information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and others. Efficient estimation of these distances is a key component in algorithms for manipulating distributions. Thus, sublinear resource constraints, either in time (property testing) or space (streaming) are crucial. We start by resolving two open questions regarding property testing of distributions. Firstly, we show a tight bound for estimating bounded, symmetric f-divergences between distributions in a general property testing (sublinear time) framework (the so-called combined oracle model). This yields optimal algorithms for estimating such well known distances as the Jensen-Shannon divergence and the Hellinger distance. Secondly, we close a $(\log n)/H$ gap between upper and lower bounds for estimating entropy $H$ in this model. In a stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. We also provide other results along the space/time/approximation tradeoff curve. △ Less

Submitted 29 August, 2005; v1 submitted 27 August, 2005; originally announced August 2005.

Comments: 18 pages

arXiv:cs/0407011 [pdf, ps, other]

Distance distribution of binary codes and the error probability of decoding

Authors: Alexander Barg, Andrew McGregor

Abstract: We address the problem of bounding below the probability of error under maximum likelihood decoding of a binary code with a known distance distribution used on a binary symmetric channel. An improved upper bound is given for the maximum attainable exponent of this probability (the reliability function of the channel). In particular, we prove that the ``random coding exponent'' is the true value… ▽ More We address the problem of bounding below the probability of error under maximum likelihood decoding of a binary code with a known distance distribution used on a binary symmetric channel. An improved upper bound is given for the maximum attainable exponent of this probability (the reliability function of the channel). In particular, we prove that the ``random coding exponent'' is the true value of the channel reliability for code rate $R$ in some interval immediately below the critical rate of the channel. An analogous result is obtained for the Gaussian channel. △ Less

Submitted 29 July, 2005; v1 submitted 4 July, 2004; originally announced July 2004.

Comments: 16 pages, 3 figures. Submitted to IEEE Transactions on Information Theory. The revision was done for a final journal version (it may still be different from the published version)

Journal ref: IEEE Transactions on Information Theory vol. 51, no. 12, pp. 4237-4246 (2005).

arXiv:nucl-ex/0205006 [pdf, ps, other]

First Results from the Sudbury Neutrino Observatory

Authors: Gordon A. McGregor

Abstract: The Sudbury Neutrino Observatory (SNO) is a water imaging Cherenkov detector. Utilising a 1 kilotonne ultra-pure D2O target, it is the first experiment to have equal sensitivity to all flavours of active neutrinos. This allows a solar-model independent test of the neutrino oscillation hypothesis to be made. Solar neutrinos from the decay of 8B have been detected at SNO by the charged-current (CC… ▽ More The Sudbury Neutrino Observatory (SNO) is a water imaging Cherenkov detector. Utilising a 1 kilotonne ultra-pure D2O target, it is the first experiment to have equal sensitivity to all flavours of active neutrinos. This allows a solar-model independent test of the neutrino oscillation hypothesis to be made. Solar neutrinos from the decay of 8B have been detected at SNO by the charged-current (CC) interaction on the deuteron and by the elastic scattering (ES) of electrons. While the CC interaction is sensitive exclusively to electron neutrinos, the ES interaction has a small sensitivity to muon and tau neutrinos. In this paper, the recent solar neutrino results from the SNO experiment are presented. The measured ES interaction rate is found to be consistent with the high precision ES measurement from the Super-Kamiokande experiment. The electron neutrino flux deduced from the CC interaction rate in SNO differs from the Super-Kamiokande ES measurement by 3.3 sigma. This is evidence of an active neutrino component, in addition to electron neutrinos, in the solar neutrino flux. These results also allow the first experimental determination of the active 8B neutrino flux from the Sun, and this is found to be in good agreement with solar model predictions △ Less

Submitted 14 May, 2002; originally announced May 2002.

Comments: 7 pages, 5 figures, to be published in the proceedings of the conference: Recontres de Moriond, Electroweak Interactions and Unified Theories (2002)

Showing 1–38 of 38 results for author: McGregor, A