Perfect $L_p$ Sampling in a Data Stream

Jayaram, Rajesh; Woodruff, David P.

Computer Science > Data Structures and Algorithms

arXiv:1808.05497v2 (cs)

[Submitted on 16 Aug 2018 (v1), revised 28 Nov 2018 (this version, v2), latest version 11 Nov 2019 (v3)]

Title:Perfect $L_p$ Sampling in a Data Stream

Authors:Rajesh Jayaram, David P. Woodruff

View PDF

Abstract:In this paper, we resolve the one-pass space complexity of $L_p$ sampling for $p \in (0,2)$. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler must output an index $i$ with probability $|f_i|^p/\|f\|_p^p$, and is allowed to fail with some probability $\delta$. So far, for $p > 0$ no algorithm has been shown to solve the problem exactly using $\text{poly}( \log n)$-bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate $L_p$ sampler, which outputs $i$ with probability $(1 \pm \nu)|f_i|^p /\|f\|_p^p$, using space polynomial in $\nu^{-1}$ and $\log(n)$. The space complexity was later reduced by Jowhari, Sağlam, and Tardos to roughly $O(\nu^{-p} \log^2 n \log \delta^{-1})$ for $p \in (0,2)$, which tightly matches the $\Omega(\log^2 n \log \delta^{-1})$ lower bound in terms of $n$ and $\delta$, but is loose in terms of $\nu$.
Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of $\nu$---not even a bound of $\Omega(\nu^{-1})$ is known. In this paper, we explain this phenomenon by demonstrating the existence of an $O(\log^2 n \log \delta^{-1})$-bit perfect $L_p$ sampler for $p \in (0,2)$. This shows that $\nu$ need not factor into the space of an $L_p$ sampler, which closes the complexity of the problem for this range of $p$. For $p=2$, our bound is $O(\log^3 n \log \delta^{-1})$-bits, which matches the prior best known upper bound in terms of $n,\delta$, but has no dependence on $\nu$. For $p<2$, our bound holds in the random oracle model, matching the lower bounds in that model. Moreover, we show that our algorithm can be derandomized with only a $O((\log \log n)^2)$ blow-up in the space (and no blow-up for $p=2$). Our derandomization technique is general, and can be used to derandomize a large class of linear sketches.

Comments:	An earlier version of this work appeared in FOCS 2018, but contained an error in the derandomization. In this version, we correct this issue, albeit with a (log log n)^2 -factor increase in the space required to derandomize the algorithm for p<2. Our results in the random oracle model and for p = 2 are unaffected. We also give alternative algorithms and additional applications."
Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1808.05497 [cs.DS]
	(or arXiv:1808.05497v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1808.05497

Submission history

From: Rajesh Jayaram [view email]
[v1] Thu, 16 Aug 2018 14:00:44 UTC (555 KB)
[v2] Wed, 28 Nov 2018 18:23:52 UTC (72 KB)
[v3] Mon, 11 Nov 2019 14:22:20 UTC (859 KB)

Computer Science > Data Structures and Algorithms

Title:Perfect $L_p$ Sampling in a Data Stream

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Perfect $L_p$ Sampling in a Data Stream

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators