-
Explicit Good Codes Approaching Distance 1 in Ulam Metric
Authors:
Elazar Goldenberg,
Mursalin Habib,
Karthik C. S
Abstract:
The Ulam distance of two permutations on $[n]$ is $n$ minus the length of their longest common subsequence. In this paper, we show that for every $\varepsilon>0$, there exists some $α>0$, and an infinite set $Γ\subseteq \mathbb{N}$, such that for all $n\inΓ$, there is an explicit set $C_n$ of $(n!)^α$ many permutations on $[n]$, such that every pair of permutations in $C_n$ has pairwise Ulam dista…
▽ More
The Ulam distance of two permutations on $[n]$ is $n$ minus the length of their longest common subsequence. In this paper, we show that for every $\varepsilon>0$, there exists some $α>0$, and an infinite set $Γ\subseteq \mathbb{N}$, such that for all $n\inΓ$, there is an explicit set $C_n$ of $(n!)^α$ many permutations on $[n]$, such that every pair of permutations in $C_n$ has pairwise Ulam distance at least $(1-\varepsilon)\cdot n$. Moreover, we can compute the $i^{\text{th}}$ permutation in $C_n$ in poly$(n)$ time and can also decode in poly$(n)$ time, a permutation $π$ on $[n]$ to its closest permutation $π^*$ in $C_n$, if the Ulam distance of $π$ and $π^*$ is less than $ \frac{(1-\varepsilon)\cdot n}{4} $.
Previously, it was implicitly known by combining works of Goldreich and Wigderson [Israel Journal of Mathematics'23] and Farnoud, Skachek, and Milenkovic [IEEE Transactions on Information Theory'13] in a black-box manner, that it is possible to explicitly construct $(n!)^{Ω(1)}$ many permutations on $[n]$, such that every pair of them have pairwise Ulam distance at least $\frac{n}{6}\cdot (1-\varepsilon)$, for any $\varepsilon>0$, and the bound on the distance can be improved to $\frac{n}{4}\cdot (1-\varepsilon)$ if the construction of Goldreich and Wigderson is directly analyzed in the Ulam metric.
△ Less
Submitted 11 May, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Can You Solve Closest String Faster than Exhaustive Search?
Authors:
Amir Abboud,
Nick Fischer,
Elazar Goldenberg,
Karthik C. S.,
Ron Safier
Abstract:
We study the fundamental problem of finding the best string to represent a given set, in the form of the Closest String problem: Given a set $X \subseteq Σ^d$ of $n$ strings, find the string $x^*$ minimizing the radius of the smallest Hamming ball around $x^*$ that encloses all the strings in $X$. In this paper, we investigate whether the Closest String problem admits algorithms that are faster th…
▽ More
We study the fundamental problem of finding the best string to represent a given set, in the form of the Closest String problem: Given a set $X \subseteq Σ^d$ of $n$ strings, find the string $x^*$ minimizing the radius of the smallest Hamming ball around $x^*$ that encloses all the strings in $X$. In this paper, we investigate whether the Closest String problem admits algorithms that are faster than the trivial exhaustive search algorithm. We obtain the following results for the two natural versions of the problem:
$\bullet$ In the continuous Closest String problem, the goal is to find the solution string $x^*$ anywhere in $Σ^d$. For binary strings, the exhaustive search algorithm runs in time $O(2^d poly(nd))$ and we prove that it cannot be improved to time $O(2^{(1-ε) d} poly(nd))$, for any $ε> 0$, unless the Strong Exponential Time Hypothesis fails.
$\bullet$ In the discrete Closest String problem, $x^*$ is required to be in the input set $X$. While this problem is clearly in polynomial time, its fine-grained complexity has been pinpointed to be quadratic time $n^{2 \pm o(1)}$ whenever the dimension is $ω(\log n) < d < n^{o(1)}$. We complement this known hardness result with new algorithms, proving essentially that whenever $d$ falls out of this hard range, the discrete Closest String problem can be solved faster than exhaustive search. In the small-$d$ regime, our algorithm is based on a novel application of the inclusion-exclusion principle.
Interestingly, all of our results apply (and some are even stronger) to the natural dual of the Closest String problem, called the Remotest String problem, where the task is to find a string maximizing the Hamming distance to all the strings in $X$.
△ Less
Submitted 29 May, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
An Algorithmic Bridge Between Hamming and Levenshtein Distances
Authors:
Elazar Goldenberg,
Tomasz Kociumaka,
Robert Krauthgamer,
Barna Saha
Abstract:
The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost…
▽ More
The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost $1/a$ for a parameter $a\ge 1$. This basic variant, denoted $ED_a$, bridges classical edit distance ($a=1$) with Hamming distance ($a\to\infty$), leading to interesting algorithmic challenges: Does the time complexity of computing $ED_a$ interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating $ED_a$?
We first present a simple deterministic exact algorithm for $ED_a$ and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a $(1+ε)$-approximation of $ED_a(X,Y)$, given strings $X,Y$ of total length $n$ and a bound $k\ge ED_a(X,Y)$. For simplicity, let us focus on $k\ge 1$ and a constant $ε> 0$; then, our algorithm takes $\tilde{O}(n/a + ak^3)$ time. Unless $a=\tilde{O}(1)$ and for small enough $k$, this running time is sublinear in $n$. We also consider a very natural version that asks to find a $(k_I, k_S)$-alignment -- an alignment with at most $k_I$ indels and $k_S$ substitutions. In this setting, we give an exact algorithm and, more importantly, an $\tilde{O}(nk_I/k_S + k_S\cdot k_I^3)$-time $(1,1+ε)$-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for $ED_a$ for $a=Θ(k_S / k_I)$. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving $(1+ε)$-approximation in sublinear time, even for a favorable choice of $k$.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Gap Edit Distance via Non-Adaptive Queries: Simple and Optimal
Authors:
Elazar Goldenberg,
Tomasz Kociumaka,
Robert Krauthgamer,
Barna Saha
Abstract:
We study the problem of approximating edit distance in sublinear time. This is formalized as the $(k,k^c)$-Gap Edit Distance problem, where the input is a pair of strings $X,Y$ and parameters $k,c>1$, and the goal is to return YES if $ED(X,Y)\leq k$, NO if $ED(X,Y)> k^c$, and an arbitrary answer when $k < ED(X,Y) \le k^c$. Recent years have witnessed significant interest in designing sublinear-tim…
▽ More
We study the problem of approximating edit distance in sublinear time. This is formalized as the $(k,k^c)$-Gap Edit Distance problem, where the input is a pair of strings $X,Y$ and parameters $k,c>1$, and the goal is to return YES if $ED(X,Y)\leq k$, NO if $ED(X,Y)> k^c$, and an arbitrary answer when $k < ED(X,Y) \le k^c$. Recent years have witnessed significant interest in designing sublinear-time algorithms for Gap Edit Distance.
In this work, we resolve the non-adaptive query complexity of Gap Edit Distance for the entire range of parameters, improving over a sequence of previous results. Specifically, we design a non-adaptive algorithm with query complexity $\tilde{O}(n/k^{c-0.5})$, and we further prove that this bound is optimal up to polylogarithmic factors.
Our algorithm also achieves optimal time complexity $\tilde{O}(n/k^{c-0.5})$ whenever $c\geq 1.5$. For $1<c<1.5$, the running time of our algorithm is $\tilde{O}(n/k^{2c-1})$. In the restricted case of $k^c=Ω(n)$, this matches a known result [Batu, Ergün, Kilian, Magen, Raskhodnikova, Rubinfeld, and Sami; STOC 2003], and in all other (nontrivial) cases, our running time is strictly better than all previous algorithms, including the adaptive ones. However, an independent work of Bringmann, Cassis, Fischer, and Nakos [STOC 2022] provides an adaptive algorithm that bypasses the non-adaptive lower bound, but only for small enough $k$ and $c$.
△ Less
Submitted 2 October, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
Does Preprocessing help in Fast Sequence Comparisons?
Authors:
Elazar Goldenberg,
Aviad Rubinstein,
Barna Saha
Abstract:
We study edit distance computation with preprocessing: the preprocessing algorithm acts on each string separately, and then the query algorithm takes as input the two preprocessed strings. This model is inspired by scenarios where we would like to compute edit distance between many pairs in the same pool of strings.
Our results include:
Permutation-LCS: If the LCS between two permutations has…
▽ More
We study edit distance computation with preprocessing: the preprocessing algorithm acts on each string separately, and then the query algorithm takes as input the two preprocessed strings. This model is inspired by scenarios where we would like to compute edit distance between many pairs in the same pool of strings.
Our results include:
Permutation-LCS: If the LCS between two permutations has length $n-k$, we can compute it \textit{ exactly} with $O(n \log(n))$ preprocessing and $O(k \log(n))$ query time.
Small edit distance: For general strings, if their edit distance is at most $k$, we can compute it \textit{ exactly} with $O(n\log(n))$ preprocessing and $O(k^2 \log(n))$ query time.
Approximate edit distance: For the most general input, we can approximate the edit distance to within factor $(7+o(1))$ with preprocessing time $\tilde{O}(n^2)$ and query time $\tilde{O}(n^{1.5+o(1)})$.
All of these results significantly improve over the state of the art in edit distance computation without preprocessing. Interestingly, by combining ideas from our algorithms with preprocessing, we provide new improved results for approximating edit distance without preprocessing in subquadratic time.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Sublinear Algorithms for Gap Edit Distance
Authors:
Elazar Goldenberg,
Robert Krauthgamer,
Barna Saha
Abstract:
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length $n$ in $O(n^2)$ time, and a more sophisticated algorithm runs in time $O(n+t^2)$ when the edit…
▽ More
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length $n$ in $O(n^2)$ time, and a more sophisticated algorithm runs in time $O(n+t^2)$ when the edit distance is $t$ [Landau, Myers and Schmidt, SICOMP 1998]. In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in near-linear time [Andoni, Krauthgamer and Onak, FOCS 2010], and a constant-factor approximation in subquadratic time [Chakrabarty, Das, Goldenberg, Koucký and Saks, FOCS 2018].
We study sublinear-time algorithms for small edit distance, which was investigated extensively because of its numerous applications. Our main result is an algorithm for distinguishing whether the edit distance is at most $t$ or at least $t^2$ (the quadratic gap problem) in time $\tilde{O}(\frac{n}{t}+t^3)$. This time bound is sublinear roughly for all $t$ in $[ω(1), o(n^{1/3})]$, which was not known before. The best previous algorithms solve this problem in sublinear time only for $t=ω(n^{1/3})$ [Andoni and Onak, STOC 2009].
Our algorithm is based on a new approach that adaptively switches between uniform sampling and reading contiguous blocks of the input strings. In contrast, all previous algorithms choose which coordinates to query non-adaptively. Moreover, it can be extended to solve the $t$ vs $t^{2-ε}$ gap problem in time $\tilde{O}(\frac{n}{t^{1-ε}}+t^3)$.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Hardness Amplification of Optimization Problems
Authors:
Elazar Goldenberg,
Karthik C. S.
Abstract:
In this paper, we prove a general hardness amplification scheme for optimization problems based on the technique of direct products. We say that an optimization problem $Π$ is direct product feasible if it is possible to efficiently aggregate any $k$ instances of $Π$ and form one large instance of $Π$ such that given an optimal feasible solution to the larger instance, we can efficiently find opti…
▽ More
In this paper, we prove a general hardness amplification scheme for optimization problems based on the technique of direct products. We say that an optimization problem $Π$ is direct product feasible if it is possible to efficiently aggregate any $k$ instances of $Π$ and form one large instance of $Π$ such that given an optimal feasible solution to the larger instance, we can efficiently find optimal feasible solutions to all the $k$ smaller instances. Given a direct product feasible optimization problem $Π$, our hardness amplification theorem may be informally stated as follows: If there is a distribution $\mathcal{D}$ over instances of $Π$ of size $n$ such that every randomized algorithm running in time $t(n)$ fails to solve $Π$ on $\frac{1}{α(n)}$ fraction of inputs sampled from $\mathcal{D}$, then, assuming some relationships on $α(n)$ and $t(n)$, there is a distribution $\mathcal{D}'$ over instances of $Π$ of size $O(n\cdot α(n))$ such that every randomized algorithm running in time $\frac{t(n)}{poly(α(n))}$ fails to solve $Π$ on $\frac{99}{100}$ fraction of inputs sampled from $\mathcal{D}'$. As a consequence of the above theorem, we show hardness amplification of problems in various classes such as NP-hard problems like Max-Clique, Knapsack, and Max-SAT, problems in P such as Longest Common Subsequence, Edit Distance, Matrix Multiplication, and even problems in TFNP such as Factoring and computing Nash equilibrium.
△ Less
Submitted 27 August, 2019;
originally announced August 2019.
-
Towards a General Direct Product Testing Theorem
Authors:
Elazar Goldenberg,
Karthik C. S.
Abstract:
The Direct Product encoding of a string $a\in \{0,1\}^n$ on an underlying domain $V\subseteq \binom{n}{k}$, is a function DP$_V(a)$ which gets as input a set $S\in V$ and outputs $a$ restricted to $S$. In the Direct Product Testing Problem, we are given a function $F:V\to \{0,1\}^k$, and our goal is to test whether $F$ is close to a direct product encoding, i.e., whether there exists some…
▽ More
The Direct Product encoding of a string $a\in \{0,1\}^n$ on an underlying domain $V\subseteq \binom{n}{k}$, is a function DP$_V(a)$ which gets as input a set $S\in V$ and outputs $a$ restricted to $S$. In the Direct Product Testing Problem, we are given a function $F:V\to \{0,1\}^k$, and our goal is to test whether $F$ is close to a direct product encoding, i.e., whether there exists some $a\in \{0,1\}^n$ such that on most sets $S$, we have $F(S)=$DP$_V(a)(S)$. A natural test is as follows: select a pair $(S,S')\in V$ according to some underlying distribution over $V\times V$, query $F$ on this pair, and check for consistency on their intersection. Note that the above distribution may be viewed as a weighted graph over the vertex set $V$ and is referred to as a test graph. The testability of direct products was studied over various specific domains and test graphs (for example see Dinur-Steurer [CCC'14]; Dinur-Kaufman [FOCS'17]). In this paper, we study the testability of direct products in a general setting, addressing the question: what properties of the domain and the test graph allow one to prove a direct product testing theorem? Towards this goal we introduce the notion of coordinate expansion of a test graph. Roughly speaking a test graph is a coordinate expander if it has global and local expansion, and has certain nice intersection properties on sampling. We show that whenever the test graph has coordinate expansion then it admits a direct product testing theorem. Additionally, for every $k$ and $n$ we provide a direct product domain $V\subseteq \binom{n}{k}$ of size $n$, called the Sliding Window domain for which we prove direct product testability.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time
Authors:
Diptarka Chakraborty,
Debarati Das,
Elazar Goldenberg,
Michal Koucky,
Michael Saks
Abstract:
Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer, and Onak (2010) gave a nearly linear time algorithm that approximates edit distance…
▽ More
Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer, and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within an approximation factor $\text{poly}(\log n)$.
In this paper, we provide an algorithm with running time $\tilde{O}(n^{2-2/7})$ that approximates the edit distance within a constant factor.
△ Less
Submitted 15 February, 2021; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Streaming Algorithms For Computing Edit Distance Without Exploiting Suffix Trees
Authors:
Diptarka Chakraborty,
Elazar Goldenberg,
Michal Koucký
Abstract:
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other.
In this paper we study the computational problem of computing the edit distance between a pair of strings where their distance is bounded by a parameter $k\ll n$. We present two s…
▽ More
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other.
In this paper we study the computational problem of computing the edit distance between a pair of strings where their distance is bounded by a parameter $k\ll n$. We present two streaming algorithms for computing edit distance: One runs in time $O(n+k^2)$ and the other $n+O(k^3)$. By writing $n+O(k^3)$ we want to emphasize that the number of operations per an input symbol is a small constant. In particular, the running time does not depend on the alphabet size, and the algorithm should be easy to implement.
Previously a streaming algorithm with running time $O(n+k^4)$ was given in the paper by the current authors (STOC'16). The best off-line algorithm runs in time $O(n+k^2)$ (Landau et al., 1998) which is known to be optimal under the Strong Exponential Time Hypothesis.
△ Less
Submitted 13 July, 2016;
originally announced July 2016.
-
May We Have Your Attention: Analysis of a Selective Attention Task
Authors:
Eldan Goldenberg,
Jacob R. Garcowski,
Randall D. Beer
Abstract:
In this paper we present a deeper analysis than has previously been carried out of a selective attention problem, and the evolution of continuous-time recurrent neural networks to solve it. We show that the task has a rich structure, and agents must solve a variety of subproblems to perform well. We consider the relationship between the complexity of an agent and the ease with which it can evolv…
▽ More
In this paper we present a deeper analysis than has previously been carried out of a selective attention problem, and the evolution of continuous-time recurrent neural networks to solve it. We show that the task has a rich structure, and agents must solve a variety of subproblems to perform well. We consider the relationship between the complexity of an agent and the ease with which it can evolve behavior that generalizes well across subproblems, and demonstrate a sha** protocol that improves generalization.
△ Less
Submitted 29 June, 2006;
originally announced June 2006.