Search | arXiv e-print repository

Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

Authors: Daniel Gibney, Ce **, Tomasz Kociumaka, Sharma V. Thankachan

Abstract: Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum… ▽ More Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum $\tilde{O}(\sqrt{nk}+k^2)$-time algorithm that uses $\tilde{O}(\sqrt{nk})$ queries, where $\tilde{O}(\cdot)$ hides polylogarithmic factors. This query complexity is unconditionally optimal, and any significant improvement in the time complexity would resolve a long-standing open question of whether edit distance admits an $O(n^{2-ε})$-time quantum algorithm. Our divide-and-conquer quantum algorithm reduces the edit distance problem to a case where the strings have small Lempel-Ziv factorizations. Then, it combines a quantum LZ compression algorithm with a classical edit-distance subroutine for compressed strings. The LZ factorization problem can be classically solved in $O(n)$ time, which is unconditionally optimal in the quantum setting. We can, however, hope for a quantum speedup if we parameterize the complexity in terms of the factorization size $z$. Already a generic oracle identification algorithm yields the optimal query complexity of $\tilde{O}(\sqrt{nz})$ at the price of exponential running time. Our second main contribution is a quantum algorithm that achieves the optimal time complexity of $\tilde{O}(\sqrt{nz})$. The key tool is a novel LZ-like factorization of size $O(z\log^2n)$ whose subsequent factors can be efficiently computed through a combination of classical and quantum techniques. We can then obtain the string's run-length encoded Burrows-Wheeler Transform (BWT), construct the $r$-index, and solve many fundamental string processing problems in time $\tilde{O}(\sqrt{nz})$. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted to SODA 2024. arXiv admin note: substantial text overlap with arXiv:2302.07235

arXiv:2302.07235 [pdf, other]

Compressibility-Aware Quantum Algorithms on Strings

Authors: Daniel Gibney, Sharma V. Thankachan

Abstract: Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain… ▽ More Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain the results below. We first provide a quantum algorithm running in $\tilde{O}(\sqrt{zn})$ time for finding the LZ77 factorization of an input string $T[1..n]$ with $z$ factors. Combined with multiple existing results, this yields an $\tilde{O}(\sqrt{rn})$ time quantum algorithm for finding the RL-BWT encoding with $r$ BWT runs. Note that $r = \tildeΘ(z)$. We complement these results with lower bounds proving that our algorithms are optimal (up to polylog factors). Next, we study the problem of compressed indexing, where we provide a $\tilde{O}(\sqrt{rn})$ time quantum algorithm for constructing a recently designed $\tilde{O}(r)$ space structure with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems to obtain sublinear time quantum algorithms when the input is highly compressible. For example, we show that the longest common substring of two strings of total length $n$ can be computed in $\tilde{O}(\sqrt{zn})$ time, where $z$ is the number of factors in the LZ77 factorization of their concatenation. This beats the best known $\tilde{O}(n^\frac{2}{3})$ time quantum algorithm when $z$ is sufficiently small. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2201.12454 [pdf, other]

The Complexity of Approximate Pattern Matching on De Bruijn Graphs

Authors: Daniel Gibney, Sharma V. Thankachan, Srinivas Aluru

Abstract: Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for… ▽ More Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in $O(|E|m)$ time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. The complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. We prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than $O(|E|m)$ is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $O(n\sqrt{m})$ time, where $n$ is the text's length [Abrahamson, SIAM J. Computing 1987]. △ Less

Submitted 28 January, 2022; originally announced January 2022.

arXiv:2008.11786 [pdf, ps, other]

Simple Reductions from Formula-SAT to Pattern Matching on Labeled Graphs and Subtree Isomorphism

Authors: Daniel Gibney, Gary Hoppenworth, Sharma V. Thankachan

Abstract: The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardness of general boolean formula satisfiability (Formu… ▽ More The CNF formula satisfiability problem (CNF-SAT) has been reduced to many fundamental problems in P to prove tight lower bounds under the Strong Exponential Time Hypothesis (SETH). Recently, the works of Abboud, Hansen, Vassilevska W. and Williams (STOC 16), and later, Abboud and Bringmann (ICALP 18) have proposed basing lower bounds on the hardness of general boolean formula satisfiability (Formula-SAT). Reductions from Formula-SAT have two advantages over the usual reductions from CNF-SAT: (1) conjectures on the hardness of Formula-SAT are arguably much more plausible than those of CNF-SAT, and (2) these reductions give consequences even for logarithmic improvements in a problems upper bounds. Here we give tight reductions from Formula-SAT to two more problems: pattern matching on labeled graphs (PMLG) and subtree isomorphism. Previous reductions from Formula-SAT were to sequence alignment problems such as Edit Distance, LCS, and Frechet Distance and required some technical work. This paper uses ideas similar to those used previously, but in a decidedly simpler setting, hel** to illustrate the most salient features of the underlying techniques. △ Less

Submitted 26 August, 2020; originally announced August 2020.

arXiv:1911.03035 [pdf, other]

On the Complexity of BWT-runs Minimization via Alphabet Reordering

Authors: Jason Bentley, Daniel Gibney, Sharma V. Thankachan

Abstract: The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "$r$", the number of runs in the BWT,… ▽ More The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "$r$", the number of runs in the BWT, as well as the appearance of $r$ in the time complexity of new algorithms. Unlike other popular measures of compression, the parameter $r$ is sensitive to the lexicographic ordering given to the text's alphabet. Despite several past attempts to exploit this, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes $r$ has been open for years. We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time $2^{o(σ+ \sqrt{n})}$ unless the Exponential Time Hypothesis fails, where $σ$ is the size of the alphabet and $n$ is the length of the text. We also show that the optimization problem is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path and the number of runs in the BWT of a text, providing a surprising connection between problems on graphs and text compression. Also, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides a $O(\log^2 n)$-approximation. We provide an optimal linear-time algorithm for the problem of finding a run minimizing ordering on a subset of symbols (occurring only once) under ordering constraints, and prove a generalization of this problem to a class of graphs with BWT like properties called Wheeler graphs is NP-complete. △ Less

Submitted 18 February, 2020; v1 submitted 7 November, 2019; originally announced November 2019.

arXiv:1902.01960 [pdf, other]

On the Hardness and Inapproximability of Recognizing Wheeler Graphs

Authors: Daniel Gibney, Sharma V. Thankachan

Abstract: In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions such an indexing scheme is possible. This led to the… ▽ More In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an effort to better understand under which conditions such an indexing scheme is possible. This led to the introduction of Wheeler graphs [Gagie it et al., Theor. Comput. Sci., 2017]. A Wheeler graph is a directed graph with edge labels which satisfies two simple axioms. Wheeler graphs can be indexed in a way which is space efficient and allows for fast traversal. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs. Here we answer the open question of whether or not there exists an efficient algorithm for recognizing if a graph is a Wheeler graph. We demonstrate:(i) Recognizing if a graph is a Wheeler graph is NP-complete for any edge label alphabet of size $σ\geq 2$, even for DAGs. It can be solved in linear time for $σ=1$; (ii) An optimization variant called Wheeler Graph Violation (WGV) which aims to remove the minimum number of edges needed to obtain a Wheeler graph is APX-hard, even for DAGs. Hence, unless P = NP, there exists constant $C > 1$ such that there is no $C$-approximation algorithm. We show conditioned on the Unique Games Conjecture, for every constant $C \geq 1$, it is NP-hard to find a $C$-approximation to WGV; (iii) The Wheeler Subgraph problem (WS) which aims to find the largest Wheeler subgraph is in APX for $σ=O(1)$; (iv) For the above problems there exist efficient exponential time exact algorithms, relying on graph isomorphism being computed in strictly sub-exponential time; (v) A class of graphs where the recognition problem is polynomial time solvable. △ Less

Submitted 25 February, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

arXiv:1805.06177 [pdf, ps, other]

On Computing Average Common Substring Over Run Length Encoded Sequences

Authors: Sahar Hooshmand, Neda Tavakoli, Paniz Abedin, Sharma V. Thankachan

Abstract: The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS can be computed in O(n) space and time, where n=x+y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task must be performed with little or no decompression… ▽ More The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS can be computed in O(n) space and time, where n=x+y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task must be performed with little or no decompression. In this paper, we revisit the ACS problem under this paradigm where the input sequences are given in their run-length encoded format. We present an algorithm to compute ACS(X,Y) in O(Nlog N) time using O(N) space, where N is the total length of sequences after run-length encoding. △ Less

Submitted 16 May, 2018; originally announced May 2018.

arXiv:1603.07457 [pdf, ps, other]

Parameterized Pattern Matching -- Succinctly

Authors: Arnab Ganguly, Rahul Shah, Sharma V. Thankachan

Abstract: We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $Σ_s$ and a parameterized alphabet $Σ_p$, where… ▽ More We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $Σ_s$ and a parameterized alphabet $Σ_p$, where $Σ_s \cap Σ_p = \varnothing$ and $|Σ_s \cup Σ_p|=σ$. A pattern $P$ matches a substring $S$ of $\mathsf{T}$ iff the static characters match exactly, and there exists a one-to-one function that renames the parameterized characters in $S$ to that in $P$. Previous indexing solution [Baker, STOC 1993], known as $Parameterized$ $Suffix$ $Tree$, requires $Θ(n\log n)$ bits of space, and can find all $occ$ occurrences of $P$ in $\mathcal{O}(|P|\log σ+ occ)$ time. In this paper, we present the first succinct index that occupies $n \log σ+ \mathcal{O}(n)$ bits and answers queries in $\mathcal{O}((|P|+ occ\cdot \log n) \logσ\log \log σ)$ time. We also present a compact index that occupies $\mathcal{O}(n\logσ)$ bits and answers queries in $\mathcal{O}(|P|\log σ+ occ\cdot \log n)$ time. Furthermore, the techniques are extended to obtain the first succinct representation of the index of Shibuya for $Structural$ $Matching$ [SWAT, 2000], and of Idury and Schäffer for $Parameterized$ $Dictionary$ $Matching$ [CPM, 1994]. △ Less

Submitted 5 April, 2016; v1 submitted 24 March, 2016; originally announced March 2016.

ACM Class: F.2.2

arXiv:1512.00378 [pdf, ps, other]

An In-place Framework for Exact and Approximate Shortest Unique Substring Queries

Authors: Wing-Kai Hon, Sharma V. Thankachan, Bojian Xu

Abstract: We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate $k$-mismatch SUS finding, using the minimum $2n$ memory words plus $n$ bytes space, where $n$ is the input… ▽ More We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate $k$-mismatch SUS finding, using the minimum $2n$ memory words plus $n$ bytes space, where $n$ is the input string size. By using the in-place framework, we can find the exact and approximate $k$-mismatch SUS for every string position using a total of $O(n)$ and $O(n^2)$ time, respectively, regardless of the value of $k$. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement. △ Less

Submitted 1 December, 2015; originally announced December 2015.

Comments: 15 pages. A preliminary version of this paper appears in Proceedings of the 26th International Symposium on Algorithms and Computation (ISAAC), Nagoya, Japan, 2015

arXiv:1509.08608 [pdf, other]

Probabilistic Threshold Indexing for Uncertain Strings

Authors: Sharma V. Thankachan, Manish Patil, Rahul Shah, Sudip Biswas

Abstract: Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable ch… ▽ More Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We explore the problem of indexing uncertain strings to support efficient string searching. In this paper we consider two basic problems of string searching, namely substring searching and string listing. In substring searching, the task is to find the occurrences of a deterministic string in an uncertain string. We formulate the string listing problem for uncertain strings, where the objective is to output all the strings from a collection of strings, that contain probable occurrence of a deterministic query string. Indexing solution for both these problems are significantly more challenging for uncertain strings than for deterministic strings. Given a construction time probability value $τ$, our indexes can be constructed in linear space and supports queries in near optimal time for arbitrary values of probability threshold parameter greater than $τ$. To the best of our knowledge, this is the first indexing solution for searching in uncertain strings that achieves strong theoretical bound and supports arbitrary values of probability threshold parameter. We also propose an approximate substring search index that can answer substring search queries with an additive error in optimal time. We conduct experiments to evaluate the performance of our indexes. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Comments: 14 pages, 10 figures

arXiv:1404.2677 [pdf, ps, other]

Optimal Encodings for Range Majority Queries

Authors: Gonzalo Navarro, Sharma V. Thankachan

Abstract: We study the problem of designing a data structure that reports the positions of the distinct $τ$-majorities within any range of an array $A[1,n]$, without storing $A$. A $τ$-majority in a range $A[i,j]$, for $0<τ< 1$, is an element that occurs more than $τ(j-i+1)$ times in $A[i,j]$. We show that $Ω(n\log(1/τ))$ bits are necessary for any data structure able just to count the number of distinct… ▽ More We study the problem of designing a data structure that reports the positions of the distinct $τ$-majorities within any range of an array $A[1,n]$, without storing $A$. A $τ$-majority in a range $A[i,j]$, for $0<τ< 1$, is an element that occurs more than $τ(j-i+1)$ times in $A[i,j]$. We show that $Ω(n\log(1/τ))$ bits are necessary for any data structure able just to count the number of distinct $τ$-majorities in any range. Then, we design a structure using $O(n\log(1/τ))$ bits that returns one position of each $τ$-majority of $A[i,j]$ in $O((1/τ)\log\log_w(1/τ)\log n)$ time, on a RAM machine with word size $w$ (it can output any further position where each $τ$-majority occurs in $O(1)$ additional time). Finally, we show how to remove a $\log n$ factor from the time by adding $O(n\log\log n)$ bits of space to the structure. △ Less

Submitted 3 October, 2014; v1 submitted 9 April, 2014; originally announced April 2014.

arXiv:1207.2632 [pdf, other]

On Optimal Top-K String Retrieval

Authors: Rahul Shah, Cheng Sheng, Sharma V. Thankachan, Jeffrey Scott Vitter

Abstract: Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$ (string) documents of total length $n$. The top-$k$ document retrieval problem is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a parameter $k$ come as a query, the index returns the $k$ most relevant documents to the pattern $P$. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this p… ▽ More Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$ (string) documents of total length $n$. The top-$k$ document retrieval problem is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a parameter $k$ come as a query, the index returns the $k$ most relevant documents to the pattern $P$. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in $O(p + k\log k)$ time. This was improved by Navarro and Nekrich \cite{NN12} to $O(p + k)$. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into $O(p)$ subproblems and thus incurs the additive factor of $O(p)$. In external memory, these approaches will lead to $O(p)$ I/Os instead of optimal $O(p/B)$ I/O term where $B$ is the block-size. We re-interpret the problem independent of $p$, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-$k$ queries (with unsorted outputs) in near optimal $O(p/B + \log_B n + \log^{(h)} n + k/B)$ I/Os for any constant $h${$\log^{(1)}n =\log n$ and $\log^{(h)} n = \log (\log^{(h-1)} n)$}. Then we get $O(n\log^*n)$ space index with optimal $O(p/B+\log_B n + k/B)$ I/Os. △ Less

Submitted 17 November, 2012; v1 submitted 11 July, 2012; originally announced July 2012.

Comments: 3 figures

arXiv:1108.0554 [pdf, ps, other]

Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

Authors: Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan

Abstract: Let $\D = $$ \{d_1,d_2,...d_D\}$ be a given set of $D$ string documents of total length $n$, our task is to index $\D$, such that the $k$ most relevant documents for an online query pattern $P$ of length $p$ can be retrieved efficiently. We propose an index of size $|CSA|+n\log D(2+o(1))$ bits and $O(t_{s}(p)+k\log\log n+poly\log\log n)$ query time for the basic relevance metric \emph{term-frequen… ▽ More Let $\D = $$ \{d_1,d_2,...d_D\}$ be a given set of $D$ string documents of total length $n$, our task is to index $\D$, such that the $k$ most relevant documents for an online query pattern $P$ of length $p$ can be retrieved efficiently. We propose an index of size $|CSA|+n\log D(2+o(1))$ bits and $O(t_{s}(p)+k\log\log n+poly\log\log n)$ query time for the basic relevance metric \emph{term-frequency}, where $|CSA|$ is the size (in bits) of a compressed full text index of $\D$, with $O(t_s(p))$ time for searching a pattern of length $p$ . We further reduce the space to $|CSA|+n\log D(1+o(1))$ bits, however the query time will be $O(t_s(p)+k(\log σ\log\log n)^{1+ε}+poly\log\log n)$, where $σ$ is the alphabet size and $ε>0$ is any constant. △ Less

Submitted 30 March, 2012; v1 submitted 2 August, 2011; originally announced August 2011.

Comments: 12 pages

arXiv:1007.5110 [pdf, other]

Fully Dynamic Data Structure for Top-k Queries on Uncertain Data

Authors: Manish Patil, Rahul Shah, Sharma V. Thankachan

Abstract: Top-$k$ queries allow end-users to focus on the most important (top-$k$) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-$k$ query returns $k$ tuples with the highest score. In uncertain database, top-$k$ answer depends not only on the scores but also on the membership probabilities of tuples. Seve… ▽ More Top-$k$ queries allow end-users to focus on the most important (top-$k$) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-$k$ query returns $k$ tuples with the highest score. In uncertain database, top-$k$ answer depends not only on the scores but also on the membership probabilities of tuples. Several top-$k$ definitions covering different aspects of score-probability interplay have been proposed in recent past~\cite{R10,R4,R2,R8}. Most of the existing work in this research field is focused on develo** efficient algorithms for answering top-$k$ queries on static uncertain data. Any change (insertion, deletion of a tuple or change in membership probability, score of a tuple) in underlying data forces re-computation of query answers. Such re-computations are not practical considering the dynamic nature of data in many applications. In this paper, we propose a fully dynamic data structure that uses ranking function $PRF^e(α)$ proposed by Li et al.~\cite{R8} under the generally adopted model of $x$-relations~\cite{R11}. $PRF^e$ can effectively approximate various other top-$k$ definitions on uncertain data based on the value of parameter $α$. An $x$-relation consists of a number of $x$-tuples, where $x$-tuple is a set of mutually exclusive tuples (up to a constant number) called alternatives. Each $x$-tuple in a relation randomly instantiates into one tuple from its alternatives. For an uncertain relation with $N$ tuples, our structure can answer top-$k$ queries in $O(k\log N)$ time, handles an update in $O(\log N)$ time and takes $O(N)$ space. Finally, we evaluate practical efficiency of our structure on both synthetic and real data. △ Less

Submitted 29 July, 2010; originally announced July 2010.

Showing 1–14 of 14 results for author: Thankachan, S V