-
Efficient Computation of Quantiles over Joins
Authors:
Nikolaos Tziavelis,
Nofar Carmeli,
Wolfgang Gatterbauer,
Benny Kimelfeld,
Mirek Riedewald
Abstract:
We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent…
▽ More
We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent dichotomy result rules out the existence of such an algorithm for a general family of queries and orders. Specifically, for acyclic JQs without self-joins, the problem becomes intractable for ordering by sum whenever we join more than two relations (and these joins are not trivial intersections). Moreover, even for basic ranking functions beyond sum, such as min or max over different attributes, so far it is not known whether there is any nontrivial tractable %JQ. In this work, we develop a new approach to solving %JQ. Our solution uses two subroutines: The first one needs to select what we call a "pivot answer". The second subroutine partitions the space of query answers according to this pivot, and continues searching in one partition that is represented as new %JQ over a new database. For pivot selection, we develop an algorithm that works for a large class of ranking functions that are appropriately monotone. The second subroutine requires a customized construction for the specific ranking function at hand. We show the benefit and generality of our approach by using it to establish several new complexity results. First, we prove the tractability of min and max for all acyclic JQs, thereby resolving the above question. Second, we extend the previous %JQ dichotomy for sum to all partial sums. Third, we handle the intractable cases of sum by devising a deterministic approximation scheme that applies to every acyclic JQ.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Any-k Algorithms for Enumerating Ranked Answers to Conjunctive Queries
Authors:
Nikolaos Tziavelis,
Wolfgang Gatterbauer,
Mirek Riedewald
Abstract:
We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms, which, without knowing the number k of desired answers, push down the ranking into joins by carefully ordering the computation of intermediate tuples and avoiding materialization of join answers until they are needed.…
▽ More
We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms, which, without knowing the number k of desired answers, push down the ranking into joins by carefully ordering the computation of intermediate tuples and avoiding materialization of join answers until they are needed. For this to be possible, the ranking function needs to obey a particular type of monotonicity. Supported ranking functions include the common sum-of-weights, where answers are compared by the sum of input-tuple weights, as well as any commutative selective dioid. Our results extend a well-known unranked-enumeration dichotomy, which states that only free-connex CQs are tractable (under certain hardness hypotheses and for CQs without self-joins). For this class of queries and with n denoting the size of the input, the data complexity of our ranked enumeration approach for the time to the kth CQ answer is O(n+klogk), which is only a logarithmic factor slower than the O(n+k) unranked-enumeration guarantee. A core insight of our work is that ranked enumeration for CQs is closely related to Dynamic Programming and the fundamental task of path enumeration in a weighted DAG. We uncover a previously unknown tradeoff: one any-k algorithm has lower complexity when the number of returned answers is small, the other when their number is large. This tradeoff is eliminated under a stricter monotonicity property that we define and exploit for a novel algorithm that asymptotically dominates all previously known alternatives, including Eppstein's algorithm for sum-of-weights path enumeration. We empirically demonstrate the findings of our theoretical analysis in an experimental study that highlights the superiority of our approach over the join-then-rank approach that existing database systems follow.
△ Less
Submitted 12 October, 2023; v1 submitted 11 May, 2022;
originally announced May 2022.
-
Beyond Equi-joins: Ranking, Enumeration and Factorization
Authors:
Nikolaos Tziavelis,
Wolfgang Gatterbauer,
Mirek Riedewald
Abstract:
We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with $n$ denoting the number of tuples in the database, we guarantee for acyclic full joi…
▽ More
We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with $n$ denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of $k$, the $k$ top-ranked answers are returned in $\mathcal{O}(n \operatorname{polylog} n + k \log k)$ time. This is within a polylogarithmic factor of $\mathcal{O}(n + k \log k)$, i.e., the best known complexity for equi-joins, and even of $\mathcal{O}(n+k)$, i.e., the time it takes to look at the input and return $k$ answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is $n^\ell$ for a join of $\ell$ relations. The key ingredient is a novel $\mathcal{O}(n \operatorname{polylog} n)$-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.
△ Less
Submitted 30 August, 2021; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Authors:
Nofar Carmeli,
Nikolaos Tziavelis,
Wolfgang Gatterbauer,
Benny Kimelfeld,
Mirek Riedewald
Abstract:
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is…
▽ More
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access.
We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability, and establish the corresponding generalizations of our characterizations for every set of unary FDs.
△ Less
Submitted 28 November, 2022; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Optimal Join Algorithms Meet Top-k
Authors:
Nikolaos Tziavelis,
Wolfgang Gatterbauer,
Mirek Riedewald
Abstract:
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the "best" or "most interesting" results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large int…
▽ More
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the "best" or "most interesting" results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries
Authors:
Nikolaos Tziavelis,
Deepak Ajwani,
Wolfgang Gatterbauer,
Mirek Riedewald,
Xiaofeng Yang
Abstract:
We study ranked enumeration of join-query results according to very general orders defined by selective dioids. Our main contribution is a framework for ranked enumeration over a class of dynamic programming problems that generalizes seemingly different problems that had been studied in isolation. To this end, we extend classic algorithms that find the k-shortest paths in a weighted graph. For ful…
▽ More
We study ranked enumeration of join-query results according to very general orders defined by selective dioids. Our main contribution is a framework for ranked enumeration over a class of dynamic programming problems that generalizes seemingly different problems that had been studied in isolation. To this end, we extend classic algorithms that find the k-shortest paths in a weighted graph. For full conjunctive queries, including cyclic ones, our approach is optimal in terms of the time to return the top result and the delay between results. These optimality properties are derived for the widely used notion of data complexity, which treats query size as a constant. By performing a careful cost analysis, we are able to uncover a previously unknown tradeoff between two incomparable enumeration approaches: one has lower complexity when the number of returned results is small, the other when the number is very large. We theoretically and empirically demonstrate the superiority of our techniques over batch algorithms, which produce the full result and then sort it. Our technique is not only faster for returning the first few results, but on some inputs beats the batch algorithm even when all results are produced.
△ Less
Submitted 11 September, 2020; v1 submitted 13 November, 2019;
originally announced November 2019.