-
A Refined Approximation for Euclidean k-Means
Authors:
Fabrizio Grandoni,
Rafail Ostrovsky,
Yuval Rabani,
Leonard J. Schulman,
Rakesh Venkat
Abstract:
In the Euclidean $k$-Means problem we are given a collection of $n$ points $D$ in an Euclidean space and a positive integer $k$. Our goal is to identify a collection of $k$ points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in $D$ and the closest center. This problem is known to be APX-hard and the current best approximation ratio is…
▽ More
In the Euclidean $k$-Means problem we are given a collection of $n$ points $D$ in an Euclidean space and a positive integer $k$. Our goal is to identify a collection of $k$ points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in $D$ and the closest center. This problem is known to be APX-hard and the current best approximation ratio is a primal-dual $6.357$ approximation based on a standard LP for the problem [Ahmadian et al. FOCS'17, SICOMP'20].
In this note we show how a minor modification of Ahmadian et al.'s analysis leads to a slightly improved $6.12903$ approximation. As a related result, we also show that the mentioned LP has integrality gap at least $\frac{16+\sqrt{5}}{15}>1.2157$.
△ Less
Submitted 20 September, 2021; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Function Secret Sharing for PSI-CA:With Applications to Private Contact Tracing
Authors:
Samuel Dittmer,
Yuval Ishai,
Steve Lu,
Rafail Ostrovsky,
Mohamed Elsabagh,
Nikolaos Kiourtis,
Brian Schulte,
Angelos Stavrou
Abstract:
In this work we describe a token-based solution to Contact Tracing via Distributed Point Functions (DPF) and, more generally, Function Secret Sharing (FSS). The key idea behind the solution is that FSS natively supports secure keyword search on raw sets of keywords without a need for processing the keyword sets via a data structure for set membership. Furthermore, the FSS functionality enables add…
▽ More
In this work we describe a token-based solution to Contact Tracing via Distributed Point Functions (DPF) and, more generally, Function Secret Sharing (FSS). The key idea behind the solution is that FSS natively supports secure keyword search on raw sets of keywords without a need for processing the keyword sets via a data structure for set membership. Furthermore, the FSS functionality enables adding up numerical payloads associated with multiple matches without additional interaction. These features make FSS an attractive tool for lightweight privacy-preserving searching on a database of tokens belonging to infected individuals.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Min-Sum Clustering (with Outliers)
Authors:
Sandip Banerjee,
Rafail Ostrovsky,
Yuval Rabani
Abstract:
We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98\% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99\% of the inp…
▽ More
We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98\% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99\% of the input data points. More generally, we give the following bicriteria approximation: For any $\eps > 0$, for any instance with $n$ input points and for any positive integer $n'\le n$, we compute in polynomial time a clustering of at least $(1-\eps) n'$ points of cost at most a constant factor greater than the optimal cost of clustering $n'$ points. The approximation guarantee grows with $\frac{1}{\eps}$. Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant).
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
A Combinatorial Characterization of Self-Stabilizing Population Protocols
Authors:
Shaan Mathur,
Rafail Ostrovsky
Abstract:
We fully characterize self-stabilizing functions in population protocols for complete interaction graphs. In particular, we investigate self-stabilization in systems of $n$ finite state agents in which a malicious scheduler selects an arbitrary sequence of pairwise interactions under a global fairness condition. We show a necessary and sufficient condition for self-stabilization. Specifically we s…
▽ More
We fully characterize self-stabilizing functions in population protocols for complete interaction graphs. In particular, we investigate self-stabilization in systems of $n$ finite state agents in which a malicious scheduler selects an arbitrary sequence of pairwise interactions under a global fairness condition. We show a necessary and sufficient condition for self-stabilization. Specifically we show that functions without certain set-theoretic conditions are impossible to compute in a self-stabilizing manner. Our main contribution is in the converse, where we construct a self-stabilizing protocol for all other functions that meet this characterization. Our positive construction uses Dickson's Lemma to develop the notion of the root set, a concept that turns out to fundamentally characterize self-stabilization in this model. We believe it may lend to characterizing self-stabilization in more general models as well.
△ Less
Submitted 12 October, 2020; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Population stability: regulating size in the presence of an adversary
Authors:
Shafi Goldwasser,
Rafail Ostrovsky,
Alessandra Scafuro,
Adam Sealfon
Abstract:
We introduce a new coordination problem in distributed computing that we call the population stability problem. A system of agents each with limited memory and communication, as well as the ability to replicate and self-destruct, is subjected to attacks by a worst-case adversary that can at a bounded rate (1) delete agents chosen arbitrarily and (2) insert additional agents with arbitrary initial…
▽ More
We introduce a new coordination problem in distributed computing that we call the population stability problem. A system of agents each with limited memory and communication, as well as the ability to replicate and self-destruct, is subjected to attacks by a worst-case adversary that can at a bounded rate (1) delete agents chosen arbitrarily and (2) insert additional agents with arbitrary initial state into the system. The goal is perpetually to maintain a population whose size is within a constant factor of the target size $N$. The problem is inspired by the ability of complex biological systems composed of a multitude of memory-limited individual cells to maintain a stable population size in an adverse environment. Such biological mechanisms allow organisms to heal after trauma or to recover from excessive cell proliferation caused by inflammation, disease, or normal development.
We present a population stability protocol in a communication model that is a synchronous variant of the population model of Angluin et al. In each round, pairs of agents selected at random meet and exchange messages, where at least a constant fraction of agents is matched in each round. Our protocol uses three-bit messages and $ω(\log^2 N)$ states per agent. We emphasize that our protocol can handle an adversary that can both insert and delete agents, a setting in which existing approximate counting techniques do not seem to apply. The protocol relies on a novel coloring strategy in which the population size is encoded in the variance of the distribution of colors. Individual agents can locally obtain a weak estimate of the population size by sampling from the distribution, and make individual decisions that robustly maintain a stable global population size.
△ Less
Submitted 7 March, 2018;
originally announced March 2018.
-
Strictly Balancing Matrices in Polynomial Time Using Osborne's Iteration
Authors:
Rafail Ostrovsky,
Yuval Rabani,
Arman Yousefi
Abstract:
Osborne's iteration is a method for balancing $n\times n$ matrices which is widely used in linear algebra packages, as balancing preserves eigenvalues and stabilizes their numeral computation. The iteration can be implemented in any norm over $\mathbb{R}^n$, but it is normally used in the $L_2$ norm. The choice of norm not only affects the desired balance condition, but also defines the iterated b…
▽ More
Osborne's iteration is a method for balancing $n\times n$ matrices which is widely used in linear algebra packages, as balancing preserves eigenvalues and stabilizes their numeral computation. The iteration can be implemented in any norm over $\mathbb{R}^n$, but it is normally used in the $L_2$ norm. The choice of norm not only affects the desired balance condition, but also defines the iterated balancing step itself.
In this paper we focus on Osborne's iteration in any $L_p$ norm, where $p < \infty$. We design a specific implementation of Osborne's iteration in any $L_p$ norm that converges to a strictly $ε$-balanced matrix in $\tilde{O}(ε^{-2}n^{9} K)$ iterations, where $K$ measures, roughly, the {\em number of bits} required to represent the entries of the input matrix.
This is the first result that proves that Osborne's iteration in the $L_2$ norm (or any $L_p$ norm, $p < \infty$) strictly balances matrices in polynomial time. This is a substantial improvement over our recent result (in SODA 2017) that showed weak balancing in $L_p$ norms. Previously, Schulman and Sinclair (STOC 2015) showed strong balancing of Osborne's iteration in the $L_\infty$ norm. Their result does not imply any bounds on strict balancing in other norms.
△ Less
Submitted 24 April, 2017;
originally announced April 2017.
-
Matrix Balancing in Lp Norms: A New Analysis of Osborne's Iteration
Authors:
Rafail Ostrovsky,
Yuval Rabani,
Arman Yousefi
Abstract:
We study an iterative matrix conditioning algorithm due to Osborne (1960). The goal of the algorithm is to convert a square matrix into a balanced matrix where every row and corresponding column have the same norm. The original algorithm was proposed for balancing rows and columns in the $L_2$ norm, and it works by iterating over balancing a row-column pair in fixed round-robin order. Variants of…
▽ More
We study an iterative matrix conditioning algorithm due to Osborne (1960). The goal of the algorithm is to convert a square matrix into a balanced matrix where every row and corresponding column have the same norm. The original algorithm was proposed for balancing rows and columns in the $L_2$ norm, and it works by iterating over balancing a row-column pair in fixed round-robin order. Variants of the algorithm for other norms have been heavily studied and are implemented as standard preconditioners in many numerical linear algebra packages. Recently, Schulman and Sinclair (2015), in a first result of its kind for any norm, analyzed the rate of convergence of a variant of Osborne's algorithm that uses the $L_{\infty}$ norm and a different order of choosing row-column pairs. In this paper we study matrix balancing in the $L_1$ norm and other $L_p$ norms. We show the following results for any matrix $A = (a_{ij})_{i,j=1}^n$, resolving in particular a main open problem mentioned by Schulman and Sinclair.
1) We analyze the iteration for the $L_1$ norm under a greedy order of balancing. We show that it converges to an $ε$-balanced matrix in $K = O(\min\{ε^{-2}\log w,ε^{-1}n^{3/2}\log(w/ε)\})$ iterations that cost a total of $O(m + Kn\log n)$ arithmetic operations over $O(n\log w)$-bit numbers. Here $m$ is the number of non-zero entries of $A$, and $w = \sum_{i,j} |a_{ij}|/a_{\min}$ with $a_{\min} = \min\{|a_{ij}|:\ a_{ij}\neq 0\}$.
2) We show that the original round-robin implementation converges to an $ε$-balanced matrix in $O(ε^{-2}n^2\log w)$ iterations totalling $O(ε^{-2}mn\log w)$ arithmetic operations over $O(n\log w)$-bit numbers.
3) We demonstrate a lower bound of $Ω(1/\sqrtε)$ on the convergence rate of any implementation of the iteration.
△ Less
Submitted 26 June, 2016;
originally announced June 2016.
-
Space-Time Tradeoffs for Distributed Verification
Authors:
Rafail Ostrovsky,
Mor Perry,
Will Rosenbaum
Abstract:
Verifying that a network configuration satisfies a given boolean predicate is a fundamental problem in distributed computing. Many variations of this problem have been studied, for example, in the context of proof labeling schemes (PLS), locally checkable proofs (LCP), and non-deterministic local decision (NLD). In all of these contexts, verification time is assumed to be constant. Korman, Kutten…
▽ More
Verifying that a network configuration satisfies a given boolean predicate is a fundamental problem in distributed computing. Many variations of this problem have been studied, for example, in the context of proof labeling schemes (PLS), locally checkable proofs (LCP), and non-deterministic local decision (NLD). In all of these contexts, verification time is assumed to be constant. Korman, Kutten and Masuzawa [PODC 2011] presented a proof-labeling scheme for MST, with poly-logarithmic verification time, and logarithmic memory at each vertex.
In this paper we introduce the notion of a $t$-PLS, which allows the verification procedure to run for super-constant time. Our work analyzes the tradeoffs of $t$-PLS between time, label size, message length, and computation space. We construct a universal $t$-PLS and prove that it uses the same amount of total communication as a known one-round universal PLS, and $t$ factor smaller labels. In addition, we provide a general technique to prove lower bounds for space-time tradeoffs of $t$-PLS. We use this technique to show an optimal tradeoff for testing that a network is acyclic (cycle free). Our optimal $t$-PLS for acyclicity uses label size and computation space $O((\log n)/t)$. We further describe a recursive $O(\log^* n)$ space verifier for acyclicity which does not assume previous knowledge of the run-time $t$.
△ Less
Submitted 20 August, 2017; v1 submitted 22 May, 2016;
originally announced May 2016.
-
Coding for interactive communication correcting insertions and deletions
Authors:
Mark Braverman,
Ran Gelles,
Jieming Mao,
Rafail Ostrovsky
Abstract:
We consider the question of interactive communication, in which two remote parties perform a computation while their communication channel is (adversarially) noisy. We extend here the discussion into a more general and stronger class of noise, namely, we allow the channel to perform insertions and deletions of symbols. These types of errors may bring the parties "out of sync", so that there is no…
▽ More
We consider the question of interactive communication, in which two remote parties perform a computation while their communication channel is (adversarially) noisy. We extend here the discussion into a more general and stronger class of noise, namely, we allow the channel to perform insertions and deletions of symbols. These types of errors may bring the parties "out of sync", so that there is no consensus regarding the current round of the protocol.
In this more general noise model, we obtain the first interactive coding scheme that has a constant rate and resists noise rates of up to $1/18-\varepsilon$. To this end we develop a novel primitive we name edit distance tree code. The edit distance tree code is designed to replace the Hamming distance constraints in Schulman's tree codes (STOC 93), with a stronger edit distance requirement. However, the straightforward generalization of tree codes to edit distance does not seem to yield a primitive that suffices for communication in the presence of synchronization problems. Giving the "right" definition of edit distance tree codes is a main conceptual contribution of this work.
△ Less
Submitted 24 May, 2016; v1 submitted 3 August, 2015;
originally announced August 2015.
-
Weighted Sampling Without Replacement from Data Streams
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Gregory Vorsanger
Abstract:
Weighted sampling without replacement has proved to be a very important tool in designing new algorithms. Efraimidis and Spirakis (IPL 2006) presented an algorithm for weighted sampling without replacement from data streams. Their algorithm works under the assumption of precise computations over the interval [0,1]. Cohen and Kaplan (VLDB 2008) used similar methods for their bottom-k sketches.
Ef…
▽ More
Weighted sampling without replacement has proved to be a very important tool in designing new algorithms. Efraimidis and Spirakis (IPL 2006) presented an algorithm for weighted sampling without replacement from data streams. Their algorithm works under the assumption of precise computations over the interval [0,1]. Cohen and Kaplan (VLDB 2008) used similar methods for their bottom-k sketches.
Efraimidis and Spirakis ask as an open question whether using finite precision arithmetic impacts the accuracy of their algorithm. In this paper we show a method to avoid this problem by providing a precise reduction from k-sampling without replacement to k-sampling with replacement. We call the resulting method Cascade Sampling.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
A randomized online quantile summary in $O(\frac{1}{\varepsilon} \log \frac{1}{\varepsilon})$ words
Authors:
David Felber,
Rafail Ostrovsky
Abstract:
A quantile summary is a data structure that approximates to $\varepsilon$-relative error the order statistics of a much larger underlying dataset.
In this paper we develop a randomized online quantile summary for the cash register data input model and comparison data domain model that uses $O(\frac{1}{\varepsilon} \log \frac{1}{\varepsilon})$ words of memory. This improves upon the previous best…
▽ More
A quantile summary is a data structure that approximates to $\varepsilon$-relative error the order statistics of a much larger underlying dataset.
In this paper we develop a randomized online quantile summary for the cash register data input model and comparison data domain model that uses $O(\frac{1}{\varepsilon} \log \frac{1}{\varepsilon})$ words of memory. This improves upon the previous best upper bound of $O(\frac{1}{\varepsilon} \log^{3/2} \frac{1}{\varepsilon})$ by Agarwal et. al. (PODS 2012). Further, by a lower bound of Hung and Ting (FAW 2010) no deterministic summary for the comparison model can outperform our randomized summary in terms of space complexity. Lastly, our summary has the nice property that $O(\frac{1}{\varepsilon} \log \frac{1}{\varepsilon})$ words suffice to ensure that the success probability is $1 - e^{-\text{poly}(1/\varepsilon)}$.
△ Less
Submitted 3 March, 2015;
originally announced March 2015.
-
Variability in data streams
Authors:
David Felber,
Rafail Ostrovsky
Abstract:
We consider the problem of tracking with small relative error an integer function $f(n)$ defined by a distributed update stream $f'(n)$. Existing streaming algorithms with worst-case guarantees for this problem assume $f(n)$ to be monotone; there are very large lower bounds on the space requirements for summarizing a distributed non-monotonic stream, often linear in the size $n$ of the stream.
I…
▽ More
We consider the problem of tracking with small relative error an integer function $f(n)$ defined by a distributed update stream $f'(n)$. Existing streaming algorithms with worst-case guarantees for this problem assume $f(n)$ to be monotone; there are very large lower bounds on the space requirements for summarizing a distributed non-monotonic stream, often linear in the size $n$ of the stream.
Input streams that give rise to large space requirements are highly variable, making relatively large jumps from one timestep to the next. However, streams often vary slowly in practice. What has heretofore been lacking is a framework for non-monotonic streams that admits algorithms whose worst-case performance is as good as existing algorithms for monotone streams and degrades gracefully for non-monotonic streams as those streams vary more quickly.
In this paper we propose such a framework. We introduce a new stream parameter, the "variability" $v$, deriving its definition in a way that shows it to be a natural parameter to consider for non-monotonic streams. It is also a useful parameter. From a theoretical perspective, we can adapt existing algorithms for monotone streams to work for non-monotonic streams, with only minor modifications, in such a way that they reduce to the monotone case when the stream happens to be monotone, and in such a way that we can refine the worst-case communication bounds from $Θ(n)$ to $\tilde{O}(v)$. From a practical perspective, we demonstrate that $v$ can be small in practice by proving that $v$ is $O(\log f(n))$ for monotone streams and $o(n)$ for streams that are "nearly" monotone or that are generated by random walks. We expect $v$ to be $o(n)$ for many other interesting input classes as well.
△ Less
Submitted 24 February, 2015;
originally announced February 2015.
-
It's Not Easy Being Three: The Approximability of Three-Dimensional Stable Matching Problems
Authors:
Rafail Ostrovsky,
Will Rosenbaum
Abstract:
In 1976, Knuth asked if the stable marriage problem (SMP) can be generalized to marriages consisting of 3 genders. In 1988, Alkan showed that the natural generalization of SMP to 3 genders ($3$GSM) need not admit a stable marriage. Three years later, Ng and Hirschberg proved that it is NP-complete to determine if given preferences admit a stable marriage. They further prove an analogous result for…
▽ More
In 1976, Knuth asked if the stable marriage problem (SMP) can be generalized to marriages consisting of 3 genders. In 1988, Alkan showed that the natural generalization of SMP to 3 genders ($3$GSM) need not admit a stable marriage. Three years later, Ng and Hirschberg proved that it is NP-complete to determine if given preferences admit a stable marriage. They further prove an analogous result for the $3$ person stable assignment ($3$PSA) problem.
In light of Ng and Hirschberg's NP-hardness result for $3$GSM and $3$PSA, we initiate the study of approximate versions of these problems. In particular, we describe two optimization variants of $3$GSM and $3$PSA: maximally stable marriage/matching (MSM) and maximum stable submarriage/submatching (MSS). We show that both variants are NP-hard to approximate within some fixed constant factor. Conversely, we describe a simple polynomial time algorithm which computes constant factor approximations for the maximally stable marriage and matching problems. Thus both variants of MSM are APX-complete.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.
-
Fast distributed almost stable marriages
Authors:
Rafail Ostrovsky,
Will Rosenbaum
Abstract:
In their seminal work on the Stable Marriage Problem, Gale and Shapley describe an algorithm which finds a stable matching in $O(n^2)$ communication rounds. Their algorithm has a natural interpretation as a distributed algorithm where each player is represented by a single processor. In this distributed model, Floreen, Kaski, Polishchuk, and Suomela recently showed that for bounded preference list…
▽ More
In their seminal work on the Stable Marriage Problem, Gale and Shapley describe an algorithm which finds a stable matching in $O(n^2)$ communication rounds. Their algorithm has a natural interpretation as a distributed algorithm where each player is represented by a single processor. In this distributed model, Floreen, Kaski, Polishchuk, and Suomela recently showed that for bounded preference lists, terminating the Gale-Shapley algorithm after a constant number of rounds results in an almost stable matching. In this paper, we describe a new deterministic distributed algorithm which finds an almost stable matching in $O(\log^5 n)$ communication rounds for arbitrary preferences. We also present a faster randomized variant which requires $O(\log^2 n)$ rounds. This run-time can be improved to $O(1)$ rounds for "almost regular" (and in particular complete) preferences. To our knowledge, these are the first sub-polynomial round distributed algorithms for any variant of the stable marriage problem with unbounded preferences.
△ Less
Submitted 2 April, 2015; v1 submitted 12 August, 2014;
originally announced August 2014.
-
Universal Streaming
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Alan Roytman
Abstract:
Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic after the stream is gone, it is impossible. But what if we want to compute a different statistic after the fact? In this paper, we consider the following fascina…
▽ More
Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic after the stream is gone, it is impossible. But what if we want to compute a different statistic after the fact? In this paper, we consider the following fascinating possibility: can we collect some small amount of specific data during the stream that is "universal," i.e., where we do not know anything about the statistics we will want to later compute, other than the guarantee that had we known the statistic ahead of time, it would have been possible to do so with small memory? In other words, is it possible to collect some data in small space during the stream, such that any other statistic that can be computed with comparable space can be computed after the fact? This is indeed what we introduce (and show) in this paper with matching upper and lower bounds: we show that it is possible to collect universal statistics of polylogarithmic size, and prove that these universal statistics allow us after the fact to compute all other statistics that are computable with similar amounts of memory. We show that this is indeed possible, both for the standard unbounded streaming model and the sliding window streaming model.
△ Less
Submitted 11 August, 2014;
originally announced August 2014.
-
On The Communication Complexity of Finding an (Approximate) Stable Marriage
Authors:
Rafail Ostrovsky,
Will Rosenbaum
Abstract:
In this paper, we consider the communication complexity of protocols that compute stable matchings. We work within the context of Gale and Shapley's original stable marriage problem\cite{GS62}: $n$ men and $n$ women each privately hold a total and strict ordering on all of the members of the opposite gender. They wish to collaborate in order to find a stable matching---a pairing of the men and wom…
▽ More
In this paper, we consider the communication complexity of protocols that compute stable matchings. We work within the context of Gale and Shapley's original stable marriage problem\cite{GS62}: $n$ men and $n$ women each privately hold a total and strict ordering on all of the members of the opposite gender. They wish to collaborate in order to find a stable matching---a pairing of the men and women such that no unmatched pair mutually prefer each other to their assigned partners in the matching. We show that any communication protocol (deterministic, nondeterministic, or randomized) that correctly ouputs a stable matching requires $Ω(n^2)$ bits of communication. Thus, the original algorithm of Gale and Shapley is communication-optimal up to a logarithmic factor. We then introduce a "divorce metric" on the set of all matchings, which allows us to consider approximately stable matchings. We describe an efficient algorithm to compute the "distance to stability" of a given matching. We then show that even under the relaxed requirement that a protocol only yield an approximate stable matching, the $Ω(n^2)$ communication lower bound still holds.
△ Less
Submitted 9 October, 2014; v1 submitted 5 June, 2014;
originally announced June 2014.
-
A Stable Marriage Requires Communication
Authors:
Yannai A. Gonczarowski,
Noam Nisan,
Rafail Ostrovsky,
Will Rosenbaum
Abstract:
The Gale-Shapley algorithm for the Stable Marriage Problem is known to take $Θ(n^2)$ steps to find a stable marriage in the worst case, but only $Θ(n \log n)$ steps in the average case (with $n$ women and $n$ men). In 1976, Knuth asked whether the worst-case running time can be improved in a model of computation that does not require sequential access to the whole input. A partial negative answer…
▽ More
The Gale-Shapley algorithm for the Stable Marriage Problem is known to take $Θ(n^2)$ steps to find a stable marriage in the worst case, but only $Θ(n \log n)$ steps in the average case (with $n$ women and $n$ men). In 1976, Knuth asked whether the worst-case running time can be improved in a model of computation that does not require sequential access to the whole input. A partial negative answer was given by Ng and Hirschberg, who showed that $Θ(n^2)$ queries are required in a model that allows certain natural random-access queries to the participants' preferences. A significantly more general - albeit slightly weaker - lower bound follows from Segal's general analysis of communication complexity, namely that $Ω(n^2)$ Boolean queries are required in order to find a stable marriage, regardless of the set of allowed Boolean queries.
Using a reduction to the communication complexity of the disjointness problem, we give a far simpler, yet significantly more powerful argument showing that $Ω(n^2)$ Boolean queries of any type are indeed required for finding a stable - or even an approximately stable - marriage. Notably, unlike Segal's lower bound, our lower bound generalizes also to (A) randomized algorithms, (B) allowing arbitrary separate preprocessing of the women's preferences profile and of the men's preferences profile, (C) several variants of the basic problem, such as whether a given pair is married in every/some stable marriage, and (D) determining whether a proposed marriage is stable or far from stable. In order to analyze "approximately stable" marriages, we introduce the notion of "distance to stability" and provide an efficient algorithm for its computation.
△ Less
Submitted 25 July, 2018; v1 submitted 29 May, 2014;
originally announced May 2014.
-
Improved Approximation Algorithms for Earth-Mover Distance in Data Streams
Authors:
Arman Yousefi,
Rafail Ostrovsky
Abstract:
For two multisets $S$ and $T$ of points in $[Δ]^2$, such that $|S| = |T|= n$, the earth-mover distance (EMD) between $S$ and $T$ is the minimum cost of a perfect bipartite matching with edges between points in $S$ and $T$, i.e., $EMD(S,T) = \min_{π:S\rightarrow T}\sum_{a\in S}||a-π(a)||_1$, where $π$ ranges over all one-to-one map**s. The sketching complexity of approximating earth-mover distanc…
▽ More
For two multisets $S$ and $T$ of points in $[Δ]^2$, such that $|S| = |T|= n$, the earth-mover distance (EMD) between $S$ and $T$ is the minimum cost of a perfect bipartite matching with edges between points in $S$ and $T$, i.e., $EMD(S,T) = \min_{π:S\rightarrow T}\sum_{a\in S}||a-π(a)||_1$, where $π$ ranges over all one-to-one map**s. The sketching complexity of approximating earth-mover distance in the two-dimensional grid is mentioned as one of the open problems in the literature. We give two algorithms for computing EMD between two multi-sets when the number of distinct points in one set is a small value $k=\log^{O(1)}(Δn)$. Our first algorithm gives a $(1+ε)$-approximation using $O(kε^{-2}\log^{4}n)$ space and works only in the insertion-only model. The second algorithm gives a $O(\min(k^3,\logΔ))$-approximation using $O(\log^{3}Δ\cdot\log\logΔ\cdot\log n)$-space in the turnstile model.
△ Less
Submitted 24 April, 2014;
originally announced April 2014.
-
Local Correctability of Expander Codes
Authors:
Brett Hemenway,
Rafail Ostrovsky,
Mary Wootters
Abstract:
In this work, we present the first local-decoding algorithm for expander codes. This yields a new family of constant-rate codes that can recover from a constant fraction of errors in the codeword symbols, and where any symbol of the codeword can be recovered with high probability by reading $N^ε$ symbols from the corrupted codeword, where $N$ is the block-length of the code.
Expander codes, intr…
▽ More
In this work, we present the first local-decoding algorithm for expander codes. This yields a new family of constant-rate codes that can recover from a constant fraction of errors in the codeword symbols, and where any symbol of the codeword can be recovered with high probability by reading $N^ε$ symbols from the corrupted codeword, where $N$ is the block-length of the code.
Expander codes, introduced by Sipser and Spielman, are formed from an expander graph $G = (V,E)$ of degree $d$, and an inner code of block-length $d$ over an alphabet $Σ$. Each edge of the expander graph is associated with a symbol in $Σ$. A string in $Σ^{E}$ will be a codeword if for each vertex in $V$, the symbols on the adjacent edges form a codeword in the inner code.
We show that if the inner code has a smooth reconstruction algorithm in the noiseless setting, then the corresponding expander code has an efficient local-correction algorithm in the noisy setting. Instantiating our construction with inner codes based on finite geometries, we obtain novel locally decodable codes with rate approaching one. This provides an alternative to the multiplicity codes of Kopparty, Saraf and Yekhanin (STOC '11) and the lifted codes of Guo, Kopparty and Sudan (ITCS '13).
△ Less
Submitted 7 January, 2015; v1 submitted 30 April, 2013;
originally announced April 2013.
-
Secure End-to-End Communication with Optimal Throughput in Unreliable Networks
Authors:
Paul Bunn,
Rafail Ostrovsky
Abstract:
We demonstrate the feasibility of end-to-end communication in highly unreliable networks. Modeling a network as a graph with vertices representing nodes and edges representing the links between them, we consider two forms of unreliability: unpredictable edge-failures, and deliberate deviation from protocol specifications by corrupt nodes.
We present a robust routing protocol for end-to-end commu…
▽ More
We demonstrate the feasibility of end-to-end communication in highly unreliable networks. Modeling a network as a graph with vertices representing nodes and edges representing the links between them, we consider two forms of unreliability: unpredictable edge-failures, and deliberate deviation from protocol specifications by corrupt nodes.
We present a robust routing protocol for end-to-end communication that is simultaneously resilient to both forms of unreliability. In particular, we prove rigorously that our protocol is SECURE against the actions of the corrupt nodes, achieves correctness (Receiver gets ALL of the messages from Sender, in order and without modification), and enjoys provably optimal throughput performance, as measured using competitive analysis.
Furthermore, our protocol does not incur any asymptotic memory overhead as compared to other protocols that are unable to handle malicious interference of corrupt nodes. In particular, our protocol requires O(n^2) memory per processor, where n is the size of the network. This represents an O(n^2) improvement over all existing protocols that have been designed for this network model.
△ Less
Submitted 26 October, 2013; v1 submitted 9 April, 2013;
originally announced April 2013.
-
How Hard is Counting Triangles in the Streaming Model
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Dan Vilenchik
Abstract:
The problem of (approximately) counting the number of triangles in a graph is one of the basic problems in graph theory. In this paper we study the problem in the streaming model. We study the amount of memory required by a randomized algorithm to solve this problem. In case the algorithm is allowed one pass over the stream, we present a best possible lower bound of $Ω(m)$ for graphs $G$ with $m$…
▽ More
The problem of (approximately) counting the number of triangles in a graph is one of the basic problems in graph theory. In this paper we study the problem in the streaming model. We study the amount of memory required by a randomized algorithm to solve this problem. In case the algorithm is allowed one pass over the stream, we present a best possible lower bound of $Ω(m)$ for graphs $G$ with $m$ edges on $n$ vertices. If a constant number of passes is allowed, we show a lower bound of $Ω(m/T)$, $T$ the number of triangles. We match, in some sense, this lower bound with a 2-pass $O(m/T^{1/3})$-memory algorithm that solves the problem of distinguishing graphs with no triangles from graphs with at least $T$ triangles. We present a new graph parameter $ρ(G)$ -- the triangle density, and conjecture that the space complexity of the triangles problem is $Ω(m/ρ(G))$. We match this by a second algorithm that solves the distinguishing problem using $O(m/ρ(G))$-memory.
△ Less
Submitted 4 April, 2013;
originally announced April 2013.
-
Approximating Large Frequency Moments with Pick-and-Drop Sampling
Authors:
Vladimir Braverman,
Rafail Ostrovsky
Abstract:
Given data stream $D = \{p_1,p_2,...,p_m\}$ of size $m$ of numbers from $\{1,..., n\}$, the frequency of $i$ is defined as $f_i = |\{j: p_j = i\}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}^n f_i^k$. We consider the problem of approximating frequency moments in insertion-only streams for $k\ge 3$. For any constant $c$ we show an $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ u…
▽ More
Given data stream $D = \{p_1,p_2,...,p_m\}$ of size $m$ of numbers from $\{1,..., n\}$, the frequency of $i$ is defined as $f_i = |\{j: p_j = i\}|$. The $k$-th \emph{frequency moment} of $D$ is defined as $F_k = \sum_{i=1}^n f_i^k$. We consider the problem of approximating frequency moments in insertion-only streams for $k\ge 3$. For any constant $c$ we show an $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ upper bound on the space complexity of the problem. Here $\log^{(c)}(n)$ is the iterative $\log$ function. To simplify the presentation, we make the following assumptions: $n$ and $m$ are polynomially far; approximation error $ε$ and parameter $k$ are constants. We observe a natural bijection between streams and special matrices. Our main technical contribution is a non-uniform sampling method on matrices. We call our method a \emph{pick-and-drop sampling}; it samples a heavy element (i.e., element $i$ with frequency $Ω(F_k)$) with probability $Ω(1/n^{1-2/k})$ and gives approximation $\tilde{f_i} \ge (1-ε)f_i$. In addition, the estimations never exceed the real values, that is $ \tilde{f_j} \le f_j$ for all $j$. As a result, we reduce the space complexity of finding a heavy element to $O(n^{1-2/k}\log(n))$ bits. We apply our method of recursive sketches and resolve the problem with $O(n^{1-2/k}\log(n)\log^{(c)}(n))$ bits.
△ Less
Submitted 4 December, 2012; v1 submitted 2 December, 2012;
originally announced December 2012.
-
How to Catch L_2-Heavy-Hitters on Sliding Windows
Authors:
Vladimir Braverman,
Ran Gelles,
Rafail Ostrovsky
Abstract:
Finding heavy-elements (heavy-hitters) in streaming data is one of the central, and well-understood tasks. Despite the importance of this problem, when considering the sliding windows model of streaming (where elements eventually expire) the problem of finding L_2-heavy elements has remained completely open despite multiple papers and considerable success in finding L_1-heavy elements.
In this p…
▽ More
Finding heavy-elements (heavy-hitters) in streaming data is one of the central, and well-understood tasks. Despite the importance of this problem, when considering the sliding windows model of streaming (where elements eventually expire) the problem of finding L_2-heavy elements has remained completely open despite multiple papers and considerable success in finding L_1-heavy elements.
In this paper, we develop the first poly-logarithmic-memory algorithm for finding L_2-heavy elements in sliding window model. Since L_2 heavy elements play a central role for many fundamental streaming problems (such as frequency moments), we believe our method would be extremely useful for many sliding-windows algorithms and applications. For example, our technique allows us not only to find L_2-heavy elements, but also heavy elements with respect to any L_p for 0<p<2 on sliding windows. Thus, our paper completely resolves the question of finding L_p-heavy elements for sliding windows with poly-logarithmic memory for all values of p since it is well known that for p>2 this task is impossible.
Our method may have other applications as well. We demonstrate a broader applicability of our novel yet simple method on two additional examples: we show how to obtain a sliding window approximation of other properties such as the similarity of two streams, or the fraction of elements that appear exactly a specified number of times within the window (the rarity problem). In these two illustrative examples of our method, we replace the current expected memory bounds with worst case bounds.
△ Less
Submitted 16 April, 2013; v1 submitted 14 December, 2010;
originally announced December 2010.
-
Rademacher Chaos, Random Eulerian Graphs and The Sparse Johnson-Lindenstrauss Transform
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Yuval Rabani
Abstract:
The celebrated dimension reduction lemma of Johnson and Lindenstrauss has numerous computational and other applications. Due to its application in practice, speeding up the computation of a Johnson-Lindenstrauss style dimension reduction is an important question. Recently, Dasgupta, Kumar, and Sarlos (STOC 2010) constructed such a transform that uses a sparse matrix. This is motivated by the desir…
▽ More
The celebrated dimension reduction lemma of Johnson and Lindenstrauss has numerous computational and other applications. Due to its application in practice, speeding up the computation of a Johnson-Lindenstrauss style dimension reduction is an important question. Recently, Dasgupta, Kumar, and Sarlos (STOC 2010) constructed such a transform that uses a sparse matrix. This is motivated by the desire to speed up the computation when applied to sparse input vectors, a scenario that comes up in applications. The sparsity of their construction was further improved by Kane and Nelson (ArXiv 2010).
We improve the previous bound on the number of non-zero entries per column of Kane and Nelson from $O(1/ε\log(1/δ)\log(k/δ))$ (where the target dimension is $k$, the distortion is $1\pm ε$, and the failure probability is $δ$) to $$ O\left({1\overε} \left({\log(1/δ)\log\log\log(1/δ) \over \log\log(1/δ)}\right)^2\right). $$
We also improve the amount of randomness needed to generate the matrix. Our results are obtained by connecting the moments of an order 2 Rademacher chaos to the combinatorial properties of random Eulerian multigraphs. Estimating the chance that a random multigraph is composed of a given number of node-disjoint Eulerian components leads to a new tail bound on the chaos. Our estimates may be of independent interest, and as this part of the argument is decoupled from the analysis of the coefficients of the chaos, we believe that our methods can be useful in the analysis of other chaoses.
△ Less
Submitted 11 November, 2010;
originally announced November 2010.
-
Recursive Sketching For Frequency Moments
Authors:
Vladimir Braverman,
Rafail Ostrovsky
Abstract:
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute $F_k$ (for $k>2$) in space complexity $O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k})$, which is optimal up to (large) poly-logarithmic factors in $n$ and $m$, where $m$ is the length of the stream and $n$ is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is…
▽ More
In a ground-breaking paper, Indyk and Woodruff (STOC 05) showed how to compute $F_k$ (for $k>2$) in space complexity $O(\mbox{\em poly-log}(n,m)\cdot n^{1-\frac2k})$, which is optimal up to (large) poly-logarithmic factors in $n$ and $m$, where $m$ is the length of the stream and $n$ is the upper bound on the number of distinct elements in a stream. The best known lower bound for large moments is $Ω(\log(n)n^{1-\frac2k})$. A follow-up work of Bhuvanagiri, Ganguly, Kesh and Saha (SODA 2006) reduced the poly-logarithmic factors of Indyk and Woodruff to $O(\log^2(m)\cdot (\log n+ \log m)\cdot n^{1-{2\over k}})$. Further reduction of poly-log factors has been an elusive goal since 2006, when Indyk and Woodruff method seemed to hit a natural "barrier." Using our simple recursive sketch, we provide a different yet simple approach to obtain a $O(\log(m)\log(nm)\cdot (\log\log n)^4\cdot n^{1-{2\over k}})$ algorithm for constant $ε$ (our bound is, in fact, somewhat stronger, where the $(\log\log n)$ term can be replaced by any constant number of $\log $ iterations instead of just two or three, thus approaching $log^*n$. Our bound also works for non-constant $ε$ (for details see the body of the paper). Further, our algorithm requires only $4$-wise independence, in contrast to existing methods that use pseudo-random generators for computing large frequency moments.
△ Less
Submitted 11 November, 2010;
originally announced November 2010.
-
Deterministic and Energy-Optimal Wireless Synchronization
Authors:
Leonid Barenboim,
Shlomi Dolev,
Rafail Ostrovsky
Abstract:
We consider the problem of clock synchronization in a wireless setting where processors must power-down their radios in order to save energy. Energy efficiency is a central goal in wireless networks, especially if energy resources are severely limited. In the current setting, the problem is to synchronize clocks of $m$ processors that wake up in arbitrary time points, such that the maximum differe…
▽ More
We consider the problem of clock synchronization in a wireless setting where processors must power-down their radios in order to save energy. Energy efficiency is a central goal in wireless networks, especially if energy resources are severely limited. In the current setting, the problem is to synchronize clocks of $m$ processors that wake up in arbitrary time points, such that the maximum difference between wake up times is bounded by a positive integer $n$, where time intervals are appropriately discretized. Currently, the best-known results for synchronization for single-hop networks of $m$ processors is a randomized algorithm due to \cite{BKO09} of O(\sqrt {n /m} \cdot poly-log(n)) awake times per processor and a lower bound of Omega(\sqrt{n/m}) of the number of awake times needed per processor \cite{BKO09}. The main open question left in their work is to close the poly-log gap between the upper and the lower bound and to de-randomize their probabilistic construction and eliminate error probability. This is exactly what we do in this paper.
That is, we show a {deterministic} algorithm with radio use of Theta(\sqrt {n /m}) that never fails. We stress that our upper bound exactly matches the lower bound proven in \cite{BKO09}, up to a small multiplicative constant. Therefore, our algorithm is {optimal} in terms of energy efficiency and completely resolves a long sequence of works in this area. In order to achieve these results we devise a novel {adaptive} technique that determines the times when devices power their radios on and off. In addition, we prove several lower bounds on the energy efficiency of algorithms for {multi-hop networks}. Specifically, we show that any algorithm for multi-hop networks must have radio use of Omega(\sqrt n) per processor.
△ Less
Submitted 6 October, 2010;
originally announced October 2010.
-
Position-Based Quantum Cryptography: Impossibility and Constructions
Authors:
Harry Buhrman,
Nishanth Chandran,
Serge Fehr,
Ran Gelles,
Vipul Goyal,
Rafail Ostrovsky,
Christian Schaffner
Abstract:
In this work, we study position-based cryptography in the quantum setting. The aim is to use the geographical position of a party as its only credential. On the negative side, we show that if adversaries are allowed to share an arbitrarily large entangled quantum state, no secure position-verification is possible at all. We show a distributed protocol for computing any unitary operation on a state…
▽ More
In this work, we study position-based cryptography in the quantum setting. The aim is to use the geographical position of a party as its only credential. On the negative side, we show that if adversaries are allowed to share an arbitrarily large entangled quantum state, no secure position-verification is possible at all. We show a distributed protocol for computing any unitary operation on a state shared between the different users, using local operations and one round of classical communication. Using this surprising result, we break any position-verification scheme of a very general form. On the positive side, we show that if adversaries do not share any entangled quantum state but can compute arbitrary quantum operations, secure position-verification is achievable. Jointly, these results suggest the interesting question whether secure position-verification is possible in case of a bounded amount of entanglement. Our positive result can be interpreted as resolving this question in the simplest case, where the bound is set to zero.
In models where secure positioning is achievable, it has a number of interesting applications. For example, it enables secure communication over an insecure channel without having any pre-shared key, with the guarantee that only a party at a specific location can learn the content of the conversation. More generally, we show that in settings where secure position-verification is achievable, other position-based cryptographic schemes are possible as well, such as secure position-based authentication and position-based key agreement.
△ Less
Submitted 12 August, 2011; v1 submitted 13 September, 2010;
originally announced September 2010.
-
Throughput in Asynchronous Networks
Authors:
Paul Bunn,
Rafail Ostrovsky
Abstract:
We introduce a new, "worst-case" model for an asynchronous communication network and investigate the simplest (yet central) task in this model, namely the feasibility of end-to-end routing. Motivated by the question of how successful a protocol can hope to perform in a network whose reliability is guaranteed by as few assumptions as possible, we combine the main "unreliability" features encounte…
▽ More
We introduce a new, "worst-case" model for an asynchronous communication network and investigate the simplest (yet central) task in this model, namely the feasibility of end-to-end routing. Motivated by the question of how successful a protocol can hope to perform in a network whose reliability is guaranteed by as few assumptions as possible, we combine the main "unreliability" features encountered in network models in the literature, allowing our model to exhibit all of these characteristics simultaneously. In particular, our model captures networks that exhibit the following properties: 1) On-line; 2) Dynamic Topology; 3)Distributed/Local Control 4) Asynchronous Communication; 5) (Polynomially) Bounded Memory; 6) No Minimal Connectivity Assumptions. In the confines of this network, we evaluate throughput performance and prove matching upper and lower bounds. In particular, using competitive analysis (perhaps somewhat surprisingly) we prove that the optimal competitive ratio of any on-line protocol is 1/n (where n is the number of nodes in the network), and then we describe a specific protocol and prove that it is n-competitive. The model we describe in the paper and for which we achieve the above matching upper and lower bounds for throughput represents the "worst-case" network, in that it makes no reliability assumptions. In many practical applications, the optimal competitive ratio of 1/n may be unacceptable, and consequently stronger assumptions must be imposed on the network to improve performance. However, we believe that a fundamental starting point to understanding which assumptions are necessary to impose on a network model, given some desired throughput performance, is to understand what is achievable in the worst case for the simplest task (namely end-to-end routing).
△ Less
Submitted 23 October, 2009;
originally announced October 2009.
-
Measuring Independence of Datasets
Authors:
Vladimir Braverman,
Rafail Ostrovsky
Abstract:
A data stream model represents setting where approximating pairwise, or $k$-wise, independence with sublinear memory is of considerable importance. In the streaming model the joint distribution is given by a stream of $k$-tuples, with the goal of testing correlations among the components measured over the entire stream. In the streaming model, Indyk and McGregor (SODA 08) recently gave exciting…
▽ More
A data stream model represents setting where approximating pairwise, or $k$-wise, independence with sublinear memory is of considerable importance. In the streaming model the joint distribution is given by a stream of $k$-tuples, with the goal of testing correlations among the components measured over the entire stream. In the streaming model, Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence. The Indyk and McGregor methods provide $\log{n}$-approximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their $\log n$-approximation for the statistical distance metric.
In this paper we solve the main open problem posed by of Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant $k$. In particular, we present an algorithm that computes an $(ε, δ)$-approximation of the statistical distance between the joint and product distributions defined by a stream of $k$-tuples. Our algorithm requires $O(({1\over ε}\log({nm\over δ}))^{(30+k)^k})$ memory and a single pass over the data stream.
△ Less
Submitted 28 February, 2009;
originally announced March 2009.
-
Near-Optimal Radio Use For Wireless Network Synchronization
Authors:
Milan Bradonjic,
Eddie Kohler,
Rafail Ostrovsky
Abstract:
We consider the model of communication where wireless devices can either switch their radios off to save energy, or switch their radios on and engage in communication. We distill a clean theoretical formulation of this problem of minimizing radio use and present near-optimal solutions. Our base model ignores issues of communication interference, although we also extend the model to handle this req…
▽ More
We consider the model of communication where wireless devices can either switch their radios off to save energy, or switch their radios on and engage in communication. We distill a clean theoretical formulation of this problem of minimizing radio use and present near-optimal solutions. Our base model ignores issues of communication interference, although we also extend the model to handle this requirement. We assume that nodes intend to communicate periodically, or according to some time-based schedule. Clearly, perfectly synchronized devices could switch their radios on for exactly the minimum periods required by their joint schedules. The main challenge in the deployment of wireless networks is to synchronize the devices' schedules, given that their initial schedules may be offset relative to one another (even if their clocks run at the same speed). We significantly improve previous results, and show optimal use of the radio for two processors and near-optimal use of the radio for synchronization of an arbitrary number of processors. In particular, for two processors we prove deterministically matching $Θ(\sqrt{n})$ upper and lower bounds on the number of times the radio has to be on, where $n$ is the discretized uncertainty period of the clock shift between the two processors. (In contrast, all previous results for two processors are randomized.) For $m=n^β$ processors (for any $β< 1$) we prove $Ω(n^{(1-β)/2})$ is the lower bound on the number of times the radio has to be switched on (per processor), and show a nearly matching (in terms of the radio use) $Õ(n^{(1-β)/2})$ randomized upper bound per processor, with failure probability exponentially close to 0. For $β\geq 1$ our algorithm runs with at most $poly-log(n)$ radio invocations per processor. Our bounds also hold in a radio-broadcast model where interference must be taken into account.
△ Less
Submitted 13 February, 2012; v1 submitted 9 October, 2008;
originally announced October 2008.
-
AMS Without 4-Wise Independence on Product Domains
Authors:
Vladimir Braverman,
Kai-Min Chung,
Zhenming Liu,
Michael Mitzenmacher,
Rafail Ostrovsky
Abstract:
In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of 4-wise independent functions on $[n]$. Our work extends that of Indyk…
▽ More
In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of 4-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1 \pm ε)$ approximation (with probability $1-δ$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.
△ Less
Submitted 3 February, 2010; v1 submitted 29 June, 2008;
originally announced June 2008.
-
Succinct Sampling on Streams
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Carlo Zaniolo
Abstract:
A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time-horizon. What makes the streaming model interesting is that as the time progresses, old items expire and new ones arrive. One of the simplest and central tasks in this model is sampling. That…
▽ More
A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time-horizon. What makes the streaming model interesting is that as the time progresses, old items expire and new ones arrive. One of the simplest and central tasks in this model is sampling. That is, the task of maintaining up to $k$ uniformly distributed items from a current time-window as old items expire and new ones arrive. We call sampling algorithms {\bf succinct} if they use provably optimal (up to constant factors) {\bf worst-case} memory to maintain $k$ items (either with or without replacement). We stress that in many applications structures that have {\em expected} succinct representation as the time progresses are not sufficient, as small probability events eventually happen with probability 1. Thus, in this paper we ask the following question: are Succinct Sampling on Streams (or $S^3$-algorithms)possible, and if so for what models? Perhaps somewhat surprisingly, we show that $S^3$-algorithms are possible for {\em all} variants of the problem mentioned above, i.e. both with and without replacement and both for one-at-a-time and bursty arrival models. Finally, we use $S^3$ algorithms to solve various problems in sliding windows model, including frequency moments, counting triangles, entropy and density estimations. For these problems we present \emph{first} solutions with provable worst-case memory guarantees.
△ Less
Submitted 14 April, 2008; v1 submitted 25 February, 2007;
originally announced February 2007.
-
Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data
Authors:
Yevgeniy Dodis,
Rafail Ostrovsky,
Leonid Reyzin,
Adam Smith
Abstract:
We provide formal definitions and efficient secure techniques for
- turning noisy information into keys usable for any cryptographic application, and, in particular,
- reliably and securely authenticating biometric data.
Our techniques apply not just to biometric information, but to any keying material that, unlike traditional cryptographic keys, is (1) not reproducible precisely and (2) n…
▽ More
We provide formal definitions and efficient secure techniques for
- turning noisy information into keys usable for any cryptographic application, and, in particular,
- reliably and securely authenticating biometric data.
Our techniques apply not just to biometric information, but to any keying material that, unlike traditional cryptographic keys, is (1) not reproducible precisely and (2) not distributed uniformly. We propose two primitives: a "fuzzy extractor" reliably extracts nearly uniform randomness R from its input; the extraction is error-tolerant in the sense that R will be the same even if the input changes, as long as it remains reasonably close to the original. Thus, R can be used as a key in a cryptographic application. A "secure sketch" produces public information about its input w that does not reveal w, and yet allows exact recovery of w given another value that is close to w. Thus, it can be used to reliably reproduce error-prone biometric inputs without incurring the security risk inherent in storing them.
We define the primitives to be both formally secure and versatile, generalizing much prior work. In addition, we provide nearly optimal constructions of both primitives for various measures of ``closeness'' of input data, such as Hamming distance, edit distance, and set difference.
△ Less
Submitted 1 April, 2008; v1 submitted 4 February, 2006;
originally announced February 2006.