-
Efficient Computation of Quantiles over Joins
Authors:
Nikolaos Tziavelis,
Nofar Carmeli,
Wolfgang Gatterbauer,
Benny Kimelfeld,
Mirek Riedewald
Abstract:
We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent…
▽ More
We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent dichotomy result rules out the existence of such an algorithm for a general family of queries and orders. Specifically, for acyclic JQs without self-joins, the problem becomes intractable for ordering by sum whenever we join more than two relations (and these joins are not trivial intersections). Moreover, even for basic ranking functions beyond sum, such as min or max over different attributes, so far it is not known whether there is any nontrivial tractable %JQ. In this work, we develop a new approach to solving %JQ. Our solution uses two subroutines: The first one needs to select what we call a "pivot answer". The second subroutine partitions the space of query answers according to this pivot, and continues searching in one partition that is represented as new %JQ over a new database. For pivot selection, we develop an algorithm that works for a large class of ranking functions that are appropriately monotone. The second subroutine requires a customized construction for the specific ranking function at hand. We show the benefit and generality of our approach by using it to establish several new complexity results. First, we prove the tractability of min and max for all acyclic JQs, thereby resolving the above question. Second, we extend the previous %JQ dichotomy for sum to all partial sums. Third, we handle the intractable cases of sum by devising a deterministic approximation scheme that applies to every acyclic JQ.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Direct Access for Answers to Conjunctive Queries with Aggregation
Authors:
Idan Eldar,
Nofar Carmeli,
Benny Kimelfeld
Abstract:
We study the fine-grained complexity of conjunctive queries with grou** and aggregation. For some common aggregate functions (e.g., min, max, count, sum), such a query can be phrased as an ordinary conjunctive query over a database annotated with a suitable commutative semiring. Specifically, we investigate the ability to evaluate such queries by constructing in log-linear time a data structure…
▽ More
We study the fine-grained complexity of conjunctive queries with grou** and aggregation. For some common aggregate functions (e.g., min, max, count, sum), such a query can be phrased as an ordinary conjunctive query over a database annotated with a suitable commutative semiring. Specifically, we investigate the ability to evaluate such queries by constructing in log-linear time a data structure that provides logarithmic-time direct access to the answers ordered by a given lexicographic order. This task is nontrivial since the number of answers might be larger than log-linear in the size of the input, and so, the data structure needs to provide a compact representation of the space of answers.
In the absence of aggregation and annotation, past research provides a sufficient tractability condition on queries and orders. For queries without self-joins, this condition is not just sufficient, but also necessary (under conventional lower-bound assumptions in fine-grained complexity). We show that all past results continue to hold for annotated databases, assuming that the annotation itself is not part of the lexicographic order. On the other hand, we show infeasibility for the case of count-distinct that does not have any efficient representation as a commutative semiring. We then investigate the ability to include the aggregate and annotation outcome in the lexicographic order. Among the hardness results, standing out as tractable is the case of a semiring with an idempotent addition, such as those of min and max. Notably, this case captures also count-distinct over a logarithmic-size domain.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Unbalanced Triangle Detection and Enumeration Hardness for Unions of Conjunctive Queries
Authors:
Karl Bringmann,
Nofar Carmeli
Abstract:
We study the enumeration of answers to Unions of Conjunctive Queries (UCQs) with optimal time guarantees. More precisely, we wish to identify the queries that can be solved with linear preprocessing time and constant delay. Despite the basic nature of this problem, it was shown only recently that UCQs can be solved within these time bounds if they admit free-connex union extensions, even if all in…
▽ More
We study the enumeration of answers to Unions of Conjunctive Queries (UCQs) with optimal time guarantees. More precisely, we wish to identify the queries that can be solved with linear preprocessing time and constant delay. Despite the basic nature of this problem, it was shown only recently that UCQs can be solved within these time bounds if they admit free-connex union extensions, even if all individual CQs in the union are intractable with respect to the same complexity measure. Our goal is to understand whether there exist additional tractable UCQs, not covered by the currently known algorithms. As a first step, we show that some previously unclassified UCQs are hard using the classic 3SUM hypothesis, via a known reduction from 3SUM to triangle listing in graphs. As a second step, we identify a question about a variant of this graph task which is unavoidable if we want to classify all self-join free UCQs: is it possible to decide the existence of a triangle in a vertex-unbalanced tripartite graph in linear time? We prove that this task is equivalent in hardness to some family of UCQs. Finally, we show a dichotomy for unions of two self-join-free CQs if we assume the answer to this question is negative. Our conclusion is that, to reason about a class of enumeration problems defined by UCQs, it is enough to study the single decision problem of detecting triangles in unbalanced graphs. Without a breakthrough for triangle detection, we have no hope to find an efficient algorithm for additional unions of two self-join free CQs. On the other hand, if we will one day have such a triangle detection algorithm, we will immediately obtain an efficient algorithm for a family of UCQs that are currently not known to be tractable.
△ Less
Submitted 14 February, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Conjunctive Queries With Self-Joins, Towards a Fine-Grained Complexity Analysis
Authors:
Nofar Carmeli,
Luc Segoufin
Abstract:
Even though query evaluation is a fundamental task in databases, known classifications of conjunctive queries by their fine-grained complexity only apply to queries without self-joins. We study how self-joins affect enumeration complexity, with the aim of building upon the known results to achieve general classifications. We do this by examining the extension of two known dichotomies: one with res…
▽ More
Even though query evaluation is a fundamental task in databases, known classifications of conjunctive queries by their fine-grained complexity only apply to queries without self-joins. We study how self-joins affect enumeration complexity, with the aim of building upon the known results to achieve general classifications. We do this by examining the extension of two known dichotomies: one with respect to linear delay, and one with respect to constant delay after linear preprocessing. As this turns out to be an intricate investigation, this paper is structured as an example-driven discussion that initiates this analysis. We show enumeration algorithms that rely on self-joins to efficiently evaluate queries that otherwise cannot be answered with the same guarantees. Due to these additional tractable cases, the hardness proofs are more complex than the self-join-free case. We show how to harness a known tagging technique to prove hardness of queries with self-joins. Our study offers sufficient conditions and necessary conditions for tractability and settles the cases of queries of low arity and queries with cyclic cores. Nevertheless, many cases remain open.
△ Less
Submitted 9 December, 2022; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Tight Fine-Grained Bounds for Direct Access on Join Queries
Authors:
Karl Bringmann,
Nofar Carmeli,
Stefan Mengel
Abstract:
We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is require…
▽ More
We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all join queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on join queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. Interestingly, while proving our lower bound, we show that self-joins do not affect the complexity of direct access (up to logarithmic factors). Our algorithm can also be used to solve queries with projections and relaxed order requirements, though in these cases, its running time not necessarily optimal. We also show that similar techniques to those used in our lower bounds can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.
△ Less
Submitted 3 May, 2024; v1 submitted 7 January, 2022;
originally announced January 2022.
-
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Authors:
Nofar Carmeli,
Nikolaos Tziavelis,
Wolfgang Gatterbauer,
Benny Kimelfeld,
Mirek Riedewald
Abstract:
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is…
▽ More
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access.
We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability, and establish the corresponding generalizations of our characterizations for every set of unary FDs.
△ Less
Submitted 28 November, 2022; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Database Repairing with Soft Functional Dependencies
Authors:
Nofar Carmeli,
Martin Grohe,
Benny Kimelfeld,
Ester Livshits,
Muhammad Tibi
Abstract:
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cos…
▽ More
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a "cardinality repair" of an inconsistent database; in soft interpretations, this subset corresponds to a "most probable world" of a probabilistic database, a "most likely intention" of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of this problem in the more general soft semantics. This paper makes a significant progress in this direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases: a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal "almost matching" of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Tuple-Independent Representations of Infinite Probabilistic Databases
Authors:
Nofar Carmeli,
Martin Grohe,
Peter Lindner,
Christoph Standke
Abstract:
Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only re…
▽ More
Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs.
The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.
△ Less
Submitted 19 April, 2022; v1 submitted 21 August, 2020;
originally announced August 2020.
-
Constructing Explainable Opinion Graphs from Review
Authors:
Nofar Carmeli,
Xiaolan Wang,
Yoshihiko Suhara,
Stefanos Angelidis,
Yuliang Li,
**feng Li,
Wang-Chiew Tan
Abstract:
The Web is a major resource of both factual and subjective information. While there are significant efforts to organize factual information into knowledge bases, there is much less work on organizing opinions, which are abundant in subjective data, into a structured format.
We present ExplainIt, a system that extracts and organizes opinions into an opinion graph, which are useful for downstream…
▽ More
The Web is a major resource of both factual and subjective information. While there are significant efforts to organize factual information into knowledge bases, there is much less work on organizing opinions, which are abundant in subjective data, into a structured format.
We present ExplainIt, a system that extracts and organizes opinions into an opinion graph, which are useful for downstream applications such as generating explainable review summaries and facilitating search over opinion phrases. In such graphs, a node represents a set of semantically similar opinions extracted from reviews and an edge between two nodes signifies that one node explains the other. ExplainIt mines explanations in a supervised method and groups similar opinions together in a weakly supervised way before combining the clusters of opinions together with their explanation relationships into an opinion graph. We experimentally demonstrate that the explanation relationships generated in the opinion graph are of good quality and our labeled datasets for explanation mining and grou** opinions are publicly available.
△ Less
Submitted 13 April, 2021; v1 submitted 29 May, 2020;
originally announced June 2020.
-
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration
Authors:
Nofar Carmeli,
Shai Zeevi,
Christoph Berkholz,
Benny Kimelfeld,
Nicole Schweikardt
Abstract:
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports…
▽ More
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this paper, we seek a structure that supports the more demanding task of a "random permutation": polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a "random access": polylogarithmic-time retrieval of an answer whose position is given.
We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a random-order enumeration whose delay is logarithmic in expectation. We also identify a subclass of UCQs for which we can provide random access with polylogarithmic access time. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
On the Enumeration Complexity of Unions of Conjunctive Queries
Authors:
Nofar Carmeli,
Markus Kröll
Abstract:
We study the enumeration complexity of Unions of Conjunctive Queries(UCQs). We aim to identify the UCQs that are tractable in the sense that the answer tuples can be enumerated with a linear preprocessing phase and a constant delay between every successive tuples. It has been established that, in the absence of self-joins and under conventional complexity assumptions, the CQs that admit such an ev…
▽ More
We study the enumeration complexity of Unions of Conjunctive Queries(UCQs). We aim to identify the UCQs that are tractable in the sense that the answer tuples can be enumerated with a linear preprocessing phase and a constant delay between every successive tuples. It has been established that, in the absence of self-joins and under conventional complexity assumptions, the CQs that admit such an evaluation are precisely the free-connex ones. A union of tractable CQs is always tractable. We generalize the notion of free-connexity from CQs to UCQs, thus showing that some unions containing intractable CQs are, in fact, tractable. Interestingly, some unions consisting of only intractable CQs are tractable too. We show how to use the techniques presented in this article also in settings where the database contains cardinality dependencies (including functional dependencies and key constraints) or when the UCQs contain disequalities. The question of finding a full characterization of the tractability of UCQs remains open. Nevertheless, we prove that for several classes of queries, free-connexity fully captures the tractable UCQs.
△ Less
Submitted 6 May, 2021; v1 submitted 10 December, 2018;
originally announced December 2018.
-
Enumeration Complexity of Conjunctive Queries with Functional Dependencies
Authors:
Nofar Carmeli,
Markus Kröll
Abstract:
We study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classifica…
▽ More
We study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. That is, some queries that are classified as hard are in fact tractable if dependencies are accounted for. We establish a generalization of the dichotomy to accommodate FDs; hence, our classification determines which combination of a CQ and a set of FDs admits constant-delay enumeration with a linear-time preprocessing.
In addition, we generalize a hardness result for cyclic CQs to accommodate a common type of FDs. Further conclusions of our development include a dichotomy for enumeration with linear delay, and a dichotomy for CQs with disequalities. Finally, we show that all our results apply to the known class of "cardinality dependencies" that generalize FDs (e.g., by stating an upper bound on the number of genres per movies, or friends per person).
△ Less
Submitted 26 September, 2021; v1 submitted 21 December, 2017;
originally announced December 2017.
-
Efficiently Enumerating Minimal Triangulations
Authors:
Nofar Carmeli,
Batya Kenig,
Benny Kimelfeld,
Markus Kröll
Abstract:
We present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where "proper" means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm can incorporate any method for (ordinary, single resul…
▽ More
We present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where "proper" means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm can incorporate any method for (ordinary, single result) triangulation or tree decomposition, and can serve as an anytime algorithm to improve such a method. We describe an extensive experimental study of an implementation on real data from different fields. Our experiments show that the algorithm improves upon central quality measures over the underlying tree decompositions, and is able to produce a large number of high-quality decompositions.
△ Less
Submitted 27 July, 2023; v1 submitted 11 April, 2016;
originally announced April 2016.