Search | arXiv e-print repository

Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

Authors: Daniel Gibney, Ce **, Tomasz Kociumaka, Sharma V. Thankachan

Abstract: Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum… ▽ More Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum $\tilde{O}(\sqrt{nk}+k^2)$-time algorithm that uses $\tilde{O}(\sqrt{nk})$ queries, where $\tilde{O}(\cdot)$ hides polylogarithmic factors. This query complexity is unconditionally optimal, and any significant improvement in the time complexity would resolve a long-standing open question of whether edit distance admits an $O(n^{2-ε})$-time quantum algorithm. Our divide-and-conquer quantum algorithm reduces the edit distance problem to a case where the strings have small Lempel-Ziv factorizations. Then, it combines a quantum LZ compression algorithm with a classical edit-distance subroutine for compressed strings. The LZ factorization problem can be classically solved in $O(n)$ time, which is unconditionally optimal in the quantum setting. We can, however, hope for a quantum speedup if we parameterize the complexity in terms of the factorization size $z$. Already a generic oracle identification algorithm yields the optimal query complexity of $\tilde{O}(\sqrt{nz})$ at the price of exponential running time. Our second main contribution is a quantum algorithm that achieves the optimal time complexity of $\tilde{O}(\sqrt{nz})$. The key tool is a novel LZ-like factorization of size $O(z\log^2n)$ whose subsequent factors can be efficiently computed through a combination of classical and quantum techniques. We can then obtain the string's run-length encoded Burrows-Wheeler Transform (BWT), construct the $r$-index, and solve many fundamental string processing problems in time $\tilde{O}(\sqrt{nz})$. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted to SODA 2024. arXiv admin note: substantial text overlap with arXiv:2302.07235

arXiv:2302.07235 [pdf, other]

Compressibility-Aware Quantum Algorithms on Strings

Authors: Daniel Gibney, Sharma V. Thankachan

Abstract: Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain… ▽ More Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain the results below. We first provide a quantum algorithm running in $\tilde{O}(\sqrt{zn})$ time for finding the LZ77 factorization of an input string $T[1..n]$ with $z$ factors. Combined with multiple existing results, this yields an $\tilde{O}(\sqrt{rn})$ time quantum algorithm for finding the RL-BWT encoding with $r$ BWT runs. Note that $r = \tildeΘ(z)$. We complement these results with lower bounds proving that our algorithms are optimal (up to polylog factors). Next, we study the problem of compressed indexing, where we provide a $\tilde{O}(\sqrt{rn})$ time quantum algorithm for constructing a recently designed $\tilde{O}(r)$ space structure with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems to obtain sublinear time quantum algorithms when the input is highly compressible. For example, we show that the longest common substring of two strings of total length $n$ can be computed in $\tilde{O}(\sqrt{zn})$ time, where $z$ is the number of factors in the LZ77 factorization of their concatenation. This beats the best known $\tilde{O}(n^\frac{2}{3})$ time quantum algorithm when $z$ is sufficiently small. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2201.12454 [pdf, other]

The Complexity of Approximate Pattern Matching on De Bruijn Graphs

Authors: Daniel Gibney, Sharma V. Thankachan, Srinivas Aluru

Abstract: Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for… ▽ More Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in $O(|E|m)$ time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. The complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. We prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than $O(|E|m)$ is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $O(n\sqrt{m})$ time, where $n$ is the text's length [Abrahamson, SIAM J. Computing 1987]. △ Less

Submitted 28 January, 2022; originally announced January 2022.

arXiv:2009.05669 [pdf, other]

Quantifying Membership Inference Vulnerability via Generalization Gap and Other Model Metrics

Authors: Jason W. Bentley, Daniel Gibney, Gary Hoppenworth, Sumit Kumar Jha

Abstract: We demonstrate how a target model's generalization gap leads directly to an effective deterministic black box membership inference attack (MIA). This provides an upper bound on how secure a model can be to MIA based on a simple metric. Moreover, this attack is shown to be optimal in the expected sense given access to only certain likely obtainable metrics regarding the network's training and perfo… ▽ More We demonstrate how a target model's generalization gap leads directly to an effective deterministic black box membership inference attack (MIA). This provides an upper bound on how secure a model can be to MIA based on a simple metric. Moreover, this attack is shown to be optimal in the expected sense given access to only certain likely obtainable metrics regarding the network's training and performance. Experimentally, this attack is shown to be comparable in accuracy to state-of-art MIAs in many cases. △ Less

Submitted 11 September, 2020; originally announced September 2020.

arXiv:2008.11786 [pdf, ps, other]

Simple Reductions from Formula-SAT to Pattern Matching on Labeled Graphs and Subtree Isomorphism

Authors: Daniel Gibney, Gary Hoppenworth, Sharma V. Thankachan

Abstract: The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardness of general boolean formula satisfiability (Formu… ▽ More The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardness of general boolean formula satisfiability (Formula-SAT). Reductions from Formula-SAT have two advantages over the usual reductions from CNF-SAT: (1) conjectures on the hardness of Formula-SAT are arguably much more plausible than those of CNF-SAT, and (2) these reductions give consequences even for logarithmic improvements in a problems upper bounds. Here we give tight reductions from Formula-SAT to two more problems: pattern matching on labeled graphs (PMLG) and subtree isomorphism. Previous reductions from Formula-SAT were to sequence alignment problems such as Edit Distance, LCS, and Frechet Distance and required some technical work. This paper uses ideas similar to those used previously, but in a decidedly simpler setting, hel** to illustrate the most salient features of the underlying techniques. △ Less

Submitted 26 August, 2020; originally announced August 2020.

arXiv:1911.03035 [pdf, other]

On the Complexity of BWT-runs Minimization via Alphabet Reordering

Authors: Jason Bentley, Daniel Gibney, Sharma V. Thankachan

Abstract: The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "$r$", the number of runs in the BWT,… ▽ More The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "$r$", the number of runs in the BWT, as well as the appearance of $r$ in the time complexity of new algorithms. Unlike other popular measures of compression, the parameter $r$ is sensitive to the lexicographic ordering given to the text's alphabet. Despite several past attempts to exploit this, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes $r$ has been open for years. We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time $2^{o(σ+ \sqrt{n})}$ unless the Exponential Time Hypothesis fails, where $σ$ is the size of the alphabet and $n$ is the length of the text. We also show that the optimization problem is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path and the number of runs in the BWT of a text, providing a surprising connection between problems on graphs and text compression. Also, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides a $O(\log^2 n)$-approximation. We provide an optimal linear-time algorithm for the problem of finding a run minimizing ordering on a subset of symbols (occurring only once) under ordering constraints, and prove a generalization of this problem to a class of graphs with BWT like properties called Wheeler graphs is NP-complete. △ Less

Submitted 18 February, 2020; v1 submitted 7 November, 2019; originally announced November 2019.

arXiv:1902.01960 [pdf, other]

On the Hardness and Inapproximability of Recognizing Wheeler Graphs

Authors: Daniel Gibney, Sharma V. Thankachan

Abstract: In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions such an indexing scheme is possible. This led to the… ▽ More In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions such an indexing scheme is possible. This led to the introduction of Wheeler graphs [Gagie it et al., Theor. Comput. Sci., 2017]. A Wheeler graph is a directed graph with edge labels which satisfies two simple axioms. Wheeler graphs can be indexed in a way which is space efficient and allows for fast traversal. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs. Here we answer the open question of whether or not there exists an efficient algorithm for recognizing if a graph is a Wheeler graph. We demonstrate:(i) Recognizing if a graph is a Wheeler graph is NP-complete for any edge label alphabet of size $σ\geq 2$, even for DAGs. It can be solved in linear time for $σ=1$; (ii) An optimization variant called Wheeler Graph Violation (WGV) which aims to remove the minimum number of edges needed to obtain a Wheeler graph is APX-hard, even for DAGs. Hence, unless P = NP, there exists constant $C > 1$ such that there is no $C$-approximation algorithm. We show conditioned on the Unique Games Conjecture, for every constant $C \geq 1$, it is NP-hard to find a $C$-approximation to WGV; (iii) The Wheeler Subgraph problem (WS) which aims to find the largest Wheeler subgraph is in APX for $σ=O(1)$; (iv) For the above problems there exist efficient exponential time exact algorithms, relying on graph isomorphism being computed in strictly sub-exponential time; (v) A class of graphs where the recognition problem is polynomial time solvable. △ Less

Submitted 25 February, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

Showing 1–7 of 7 results for author: Gibney, D