Search | arXiv e-print repository

Approximating the Longest Common Subsequence problem within a sub-polynomial factor in linear time

Abstract: The Longest Common Subsequence (LCS) of two strings is a fundamental string similarity measure with a classical dynamic programming solution taking quadratic time. Despite significant efforts, little progress was made in improving the runtime. Even in the realm of approximation, not much was known for linear time algorithms beyond the trivial $\sqrt{n}$-approximation. Recent breakthrough result pr… ▽ More The Longest Common Subsequence (LCS) of two strings is a fundamental string similarity measure with a classical dynamic programming solution taking quadratic time. Despite significant efforts, little progress was made in improving the runtime. Even in the realm of approximation, not much was known for linear time algorithms beyond the trivial $\sqrt{n}$-approximation. Recent breakthrough result provided a $n^{0.497}$-factor approximation algorithm [HSSS19], which was more recently improved to a $n^{0.4}$-factor one [BCD21]. The latter paper also showed a $n^{2-2.5α}$ time algorithm which outputs a $n^α$ approximation to the LCS, but so far no sub-polynomial approximation is known in truly subquadratic time. In this work, we show an algorithm which runs in $O(n)$ time, and outputs a $n^{o(1)}$-factor approximation to LCS$(x,y)$, with high probability, for any pair of length $n$ input strings. Our entire algorithm is merely an efficient black-box reduction to the Block-LIS problem, introduced very recently in [ANSS21], and solving the Block-LIS problem directly. △ Less

Submitted 15 December, 2021; originally announced December 2021.

arXiv:2112.05106 [pdf, other]

Estimating the Longest Increasing Subsequence in Nearly Optimal Time

Authors: Alexandr Andoni, Negev Shekel Nosatzki, Sandip Sinha, Clifford Stein

Abstract: Longest Increasing Subsequence (LIS) is a fundamental statistic of a sequence, and has been studied for decades. While the LIS of a sequence of length $n$ can be computed exactly in time $O(n\log n)$, the complexity of estimating the (length of the) LIS in sublinear time, especially when LIS $\ll n$, is still open. We show that for any integer $n$ and any $λ= o(1)$, there exists a (randomized) n… ▽ More Longest Increasing Subsequence (LIS) is a fundamental statistic of a sequence, and has been studied for decades. While the LIS of a sequence of length $n$ can be computed exactly in time $O(n\log n)$, the complexity of estimating the (length of the) LIS in sublinear time, especially when LIS $\ll n$, is still open. We show that for any integer $n$ and any $λ= o(1)$, there exists a (randomized) non-adaptive algorithm that, given a sequence of length $n$ with LIS $\ge λn$, approximates the LIS up to a factor of $1/λ^{o(1)}$ in $n^{o(1)} / λ$ time. Our algorithm improves upon prior work substantially in terms of both approximation and run-time: (i) we provide the first sub-polynomial approximation for LIS in sub-linear time; and (ii) our run-time complexity essentially matches the trivial sample complexity lower bound of $Ω(1/λ)$, which is required to obtain any non-trivial approximation of the LIS. As part of our solution, we develop two novel ideas which may be of independent interest: First, we define a new Genuine-LIS problem, where each sequence element may either be genuine or corrupted. In this model, the user receives unrestricted access to actual sequence, but does not know apriori which elements are genuine. The goal is to estimate the LIS using genuine elements only, with the minimal number of "genuiness tests". The second idea, Precision Forest, enables accurate estimations for composition of general functions from "coarse" (sub-)estimates. Precision Forest essentially generalizes classical precision sampling, which works only for summations. As a central tool, the Precision Forest is initially pre-processed on a set of samples, which thereafter is repeatedly reused by multiple sub-parts of the algorithm, improving their amortized complexity. △ Less

Submitted 1 November, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

Comments: Full version of FOCS 2022 paper

ACM Class: F.2.0

arXiv:2005.07678 [pdf, other]

Edit Distance in Near-Linear Time: it's a Constant Factor

Authors: Alexandr Andoni, Negev Shekel Nosatzki

Abstract: We present an algorithm for approximating the edit distance between two strings of length $n$ in time $n^{1+\varepsilon}$ up to a constant factor, for any $\varepsilon>0$. Our result completes a research direction set forth in the recent breakthrough paper [Chakraborty-Das-Goldenberg-Koucky-Saks, FOCS'18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadrat… ▽ More We present an algorithm for approximating the edit distance between two strings of length $n$ in time $n^{1+\varepsilon}$ up to a constant factor, for any $\varepsilon>0$. Our result completes a research direction set forth in the recent breakthrough paper [Chakraborty-Das-Goldenberg-Koucky-Saks, FOCS'18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time. The recent results of [Koucky-Saks, STOC'20] and [Brakensiek-Rubinstein, STOC'20] have shown near-linear time algorithms that obtain an additive approximation, near-linear in $n$ (equivalently, constant-factor approximation when the edit distance value is close to $n$). In contrast, our algorithm obtains a constant-factor approximation in near-linear time for any input strings. In contrast to prior algorithms, which are mostly recursing over smaller substrings, our algorithm gradually smoothes out the local contribution to the edit distance over progressively larger substrings. To accomplish this, we iteratively construct a distance oracle data structure for the metric of edit distance on all substrings of input strings, of length $n^{i\varepsilon}$ for $i=0,1,\ldots,1/\varepsilon$. The distance oracle approximates the edit distance over these substrings in a certain average sense, just enough to estimate the overall edit distance. △ Less

Submitted 14 July, 2022; v1 submitted 15 May, 2020; originally announced May 2020.

arXiv:1811.04065 [pdf, other]

Two Party Distribution Testing: Communication and Security

Authors: Alexandr Andoni, Tal Malkin, Negev Shekel Nosatzki

Abstract: We study the problem of discrete distribution testing in the two-party setting. For example, in the standard closeness testing problem, Alice and Bob each have $t$ samples from, respectively, distributions $a$ and $b$ over $[n]$, and they need to test whether $a=b$ or $a,b$ are $ε$-far for some fixed $ε>0$. This is in contrast to the well-studied one-party case, where the tester has unrestricted a… ▽ More We study the problem of discrete distribution testing in the two-party setting. For example, in the standard closeness testing problem, Alice and Bob each have $t$ samples from, respectively, distributions $a$ and $b$ over $[n]$, and they need to test whether $a=b$ or $a,b$ are $ε$-far for some fixed $ε>0$. This is in contrast to the well-studied one-party case, where the tester has unrestricted access to samples of both distributions, for which optimal bounds are known for a number of variations. Despite being a natural constraint in applications, the two-party setting has evaded attention so far. We address two fundamental aspects: 1) what is the communication complexity, and 2) can it be accomplished securely, without Alice and Bob learning extra information about each other's input. Besides closeness testing, we also study the independence testing problem, where Alice and Bob have $t$ samples from distributions $a$ and $b$ respectively, which may be correlated; the question is whether $a,b$ are independent of $ε$-far from being independent. Our contribution is three-fold: 1) Communication: we show how to gain communication efficiency with more samples, beyond the information-theoretic bound on $t$. The gain is polynomially better than what one obtains by adapting standard algorithms. 2) Lower bounds: we prove tightness of our protocols for the closeness testing, and for the independence testing when the number of samples is unbounded. These lower bounds are of independent interest as these are the first 2-party communication lower bounds for testing problems. 3) Security: we define secure distribution testing and argue that it must leak at least some minimal information. We then provide secure versions of the above protocols with an overhead that is only polynomial in the security parameter. △ Less

Submitted 9 November, 2018; originally announced November 2018.

Showing 1–4 of 4 results for author: Nosatzki, N S