Search | arXiv e-print repository

Output-sensitive Conjunctive Query Evaluation

Authors: Shaleen Deep, Hangdong Zhao, Austen Z. Fan, Paraschos Koutris

Abstract: Join evaluation is one of the most fundamental operations performed by database systems and arguably the most well-studied problem in the Database community. A staggering number of join algorithms have been developed, and commercial database engines use finely tuned join heuristics that take into account many factors including the selectivity of predicates, memory, IO, etc. However, most of the re… ▽ More Join evaluation is one of the most fundamental operations performed by database systems and arguably the most well-studied problem in the Database community. A staggering number of join algorithms have been developed, and commercial database engines use finely tuned join heuristics that take into account many factors including the selectivity of predicates, memory, IO, etc. However, most of the results have catered to either full join queries or non-full join queries but with degree constraints (such as PK-FK relationships) that make joins \emph{easier} to evaluate. Further, most of the algorithms are also not output-sensitive. In this paper, we present a novel, output-sensitive algorithm for the evaluation of acyclic Conjunctive Queries (CQs) that contain arbitrary free variables. Our result is based on a novel generalization of the Yannakakis algorithm and shows that it is possible to improve the running time guarantee of the Yannakakis algorithm by a polynomial factor. Importantly, our algorithmic improvement does not depend on the use of fast matrix multiplication, as a recently proposed algorithm does. The upper bound is complemented with matching lower bounds conditioned on two variants of the $k$-clique conjecture. The application of our algorithm recovers known prior results and improves on known state-of-the-art results for common queries such as paths and stars. △ Less

Submitted 14 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: 22 pages

arXiv:2403.12436 [pdf, ps, other]

Evaluating Datalog over Semirings: A Grounding-based Approach

Authors: Hangdong Zhao, Shaleen Deep, Paraschos Koutris, Sudeepa Roy, Val Tannen

Abstract: Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $σ$, what is the tightest possible runtime… ▽ More Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $σ$, what is the tightest possible runtime? To this end, our main contribution is a general two-phase framework for analyzing the data complexity of Datalog over $σ$: first ground the program into an equivalent system of polynomial equations (i.e. grounding) and then find the least fixpoint of the grounding over $σ$. We present algorithms that use structure-aware query evaluation techniques to obtain the smallest possible groundings. Next, efficient algorithms for fixpoint evaluation are introduced over two classes of semirings: (1) finite-rank semirings and (2) absorptive semirings of total order. Combining both phases, we obtain state-of-the-art and new algorithmic results. Finally, we complement our results with a matching fine-grained lower bound. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: To appear at PODS 2024

arXiv:2310.19642 [pdf, other]

Consistent Query Answering for Primary Keys on Rooted Tree Queries

Authors: Paraschos Koutris, Xiating Ouyang, Jef Wijsen

Abstract: We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal subset of the database satisfying the primary key constraints. For a Boolean query q, the problem CERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies q. The computational complexity of CERTAINTY(q) has been established w… ▽ More We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal subset of the database satisfying the primary key constraints. For a Boolean query q, the problem CERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies q. The computational complexity of CERTAINTY(q) has been established whenever q is a self-join-free Boolean conjunctive query, or a (not necessarily self-join-free) Boolean path query. In this paper, we take one more step towards a general classification for all Boolean conjunctive queries by considering the class of rooted tree queries. In particular, we show that for every rooted tree query q, CERTAINTY(q) is in FO, NL-hard $\cap$ LFP, or coNP-complete, and it is decidable (in polynomial time), given q, which of the three cases applies. We also extend our classification to larger classes of queries with simple primary keys. Our classification criteria rely on query homomorphisms and our polynomial-time fixpoint algorithm is based on a novel use of context-free grammar (CFG). △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: To appear in PODS'24

arXiv:2310.05385 [pdf, ps, other]

Conjunctive Queries with Negation and Aggregation: A Linear Time Characterization

Authors: Hangdong Zhao, Austen Z. Fan, Xiating Ouyang, Paraschos Koutris

Abstract: In this paper, we study the complexity of evaluating Conjunctive Queries with negation (\cqneg). First, we present an algorithm with linear preprocessing time and constant delay enumeration for a class of CQs with negation called free-connex signed-acyclic queries. We show that no other queries admit such an algorithm subject to lower bound conjectures. Second, we extend our algorithm to Conjuncti… ▽ More In this paper, we study the complexity of evaluating Conjunctive Queries with negation (\cqneg). First, we present an algorithm with linear preprocessing time and constant delay enumeration for a class of CQs with negation called free-connex signed-acyclic queries. We show that no other queries admit such an algorithm subject to lower bound conjectures. Second, we extend our algorithm to Conjunctive Queries with negation and aggregation over a general semiring, which we call Functional Aggregate Queries with negation (\faqneg). Such an algorithm achieves constant delay enumeration for the same class of queries, but with a slightly increased preprocessing time which includes an inverse Ackermann function. We show that this surprising appearance of the Ackermmann function is probably unavoidable for general semirings, but can be removed when the semiring has specific structure. Finally, we show an application of our results to computing the difference of CQs. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: 39 pages

arXiv:2309.15270 [pdf, other]

Consistent Query Answering for Primary Keys on Path Queries

Authors: Paraschos Koutris, Xiating Ouyang, Jef Wijsen

Abstract: We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal consistent subset of the database. For a Boolean query $q$, the problem $\mathsf{CERTAINTY}(q)$ takes a database as input, and asks whether or not each repair satisfies $q$. It is known that for any self-join-free Boolean conjunctive query $q$,… ▽ More We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal consistent subset of the database. For a Boolean query $q$, the problem $\mathsf{CERTAINTY}(q)$ takes a database as input, and asks whether or not each repair satisfies $q$. It is known that for any self-join-free Boolean conjunctive query $q$, $\mathsf{CERTAINTY}(q)$ is in $\mathbf{FO}$, $\mathbf{LSPACE}$-complete, or $\mathbf{coNP}$-complete. In particular, $\mathsf{CERTAINTY}(q)$ is in $\mathbf{FO}$ for any self-join-free Boolean path query $q$. In this paper, we show that if self-joins are allowed, the complexity of $\mathsf{CERTAINTY}(q)$ for Boolean path queries $q$ exhibits a tetrachotomy between $\mathbf{FO}$, $\mathbf{NL}$-complete, $\mathbf{PTIME}$-complete, and $\mathbf{coNP}$-complete. Moreover, it is decidable, in polynomial time in the size of the query~$q$, which of the four cases applies. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: An evolved version of a paper published at PODS'21

arXiv:2308.09284 [pdf, ps, other]

The Fine-Grained Complexity of CFL Reachability

Authors: Paraschos Koutris, Shaleen Deep

Abstract: Many problems in static program analysis can be modeled as the context-free language (CFL) reachability problem on directed labeled graphs. The CFL reachability problem can be generally solved in time $O(n^3)$, where $n$ is the number of vertices in the graph, with some specific cases that can be solved faster. In this work, we ask the following question: given a specific CFL, what is the exact ex… ▽ More Many problems in static program analysis can be modeled as the context-free language (CFL) reachability problem on directed labeled graphs. The CFL reachability problem can be generally solved in time $O(n^3)$, where $n$ is the number of vertices in the graph, with some specific cases that can be solved faster. In this work, we ask the following question: given a specific CFL, what is the exact exponent in the monomial of the running time? In other words, for which cases do we have linear, quadratic or cubic algorithms, and are there problems with intermediate runtimes? This question is inspired by recent efforts to classify classic problems in terms of their exact polynomial complexity, known as {\em fine-grained complexity}. Although recent efforts have shown some conditional lower bounds (mostly for the class of combinatorial algorithms), a general picture of the fine-grained complexity landscape for CFL reachability is missing. Our main contribution is lower bound results that pinpoint the exact running time of several classes of CFLs or specific CFLs under widely believed lower bound conjectures (Boolean Matrix Multiplication and $k$-Clique). We particularly focus on the family of Dyck-$k$ languages (which are strings with well-matched parentheses), a fundamental class of CFL reachability problems. We present new lower bounds for the case of sparse input graphs where the number of edges $m$ is the input parameter, a common setting in the database literature. For this setting, we show a cubic lower bound for Andersen's Pointer Analysis which significantly strengthens prior known results. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Appeared in POPL 2023. Please note the erratum on the first page

arXiv:2307.15255 [pdf, other]

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries

Authors: Yifei Yang, Hangdong Zhao, Xiangyao Yu, Paraschos Koutris

Abstract: This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join operation, to multi-table joins such that the filtering benefits can be significantly increased. Predicate transfer is inspired by the seminal theoretical results b… ▽ More This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join operation, to multi-table joins such that the filtering benefits can be significantly increased. Predicate transfer is inspired by the seminal theoretical results by Yannakakis, which uses semi-joins to pre-filter acyclic queries. Predicate transfer generalizes the theoretical results to any join graphs and use Bloom filters to replace semi-joins leading to significant speedup. Evaluation shows predicate transfer can outperform Bloom join by 3.1x on average on TPC-H benchmark. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: 6 pages, 4 figures

arXiv:2304.14557 [pdf, other]

The Fine-Grained Complexity of Boolean Conjunctive Queries and Sum-Product Problems

Authors: Austen Z. Fan, Paraschos Koutris, Hangdong Zhao

Abstract: We study the fine-grained complexity of evaluating Boolean Conjunctive Queries and their generalization to sum-of-product problems over an arbitrary semiring. For these problems, we present a general semiring-oblivious reduction from the k-clique problem to any query structure (hypergraph). Our reduction uses the notion of embedding a graph to a hypergraph, first introduced by Marx. As a consequen… ▽ More We study the fine-grained complexity of evaluating Boolean Conjunctive Queries and their generalization to sum-of-product problems over an arbitrary semiring. For these problems, we present a general semiring-oblivious reduction from the k-clique problem to any query structure (hypergraph). Our reduction uses the notion of embedding a graph to a hypergraph, first introduced by Marx. As a consequence of our reduction, we can show tight conditional lower bounds for many classes of hypergraphs, including cycles, Loomis-Whitney joins, some bipartite graphs, and chordal graphs. These lower bounds have a dependence on what we call the clique embedding power of a hypergraph H, which we believe is a quantity of independent interest. We show that the clique embedding power is always less than the submodular width of the hypergraph, and present a decidable algorithm for computing it. We conclude with many open problems for future research. △ Less

Submitted 10 May, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: To appear in ICALP'23; 23 pages; comments welcome

arXiv:2304.06221 [pdf, ps, other]

doi 10.1145/3584372.3588675

Space-Time Tradeoffs for Conjunctive Queries with Access Patterns

Authors: Hangdong Zhao, Shaleen Deep, Paraschos Koutris

Abstract: In this paper, we investigate space-time tradeoffs for answering conjunctive queries with access patterns (CQAPs). The goal is to create a space-efficient data structure in an initial preprocessing phase and use it for answering (multiple) queries in an online phase. Previous work has developed data structures that trades off space usage for answering time for queries of practical interest, such a… ▽ More In this paper, we investigate space-time tradeoffs for answering conjunctive queries with access patterns (CQAPs). The goal is to create a space-efficient data structure in an initial preprocessing phase and use it for answering (multiple) queries in an online phase. Previous work has developed data structures that trades off space usage for answering time for queries of practical interest, such as the path and triangle query. However, these approaches lack a comprehensive framework and are not generalizable. Our main contribution is a general algorithmic framework for obtaining space-time tradeoffs for any CQAP. Our framework builds upon the $\PANDA$ algorithm and tree decomposition techniques. We demonstrate that our framework captures all state-of-the-art tradeoffs that were independently produced for various queries. Further, we show surprising improvements over the state-of-the-art tradeoffs known in the existing literature for reachability queries. △ Less

Submitted 2 May, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

arXiv:2303.04811 [pdf, other]

Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Authors: Song Bian, Xiating Ouyang, Zhiwei Fan, Paraschos Koutris

Abstract: We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Baye… ▽ More We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines. △ Less

Submitted 28 May, 2024; v1 submitted 7 March, 2023; originally announced March 2023.

Comments: 22 pages, 10 figures

Journal ref: ICML 2024

arXiv:2208.12339 [pdf, other]

LinCQA: Faster Consistent Query Answering with Linear Time Guarantees

Authors: Zhiwei Fan, Paraschos Koutris, Xiating Ouyang, Jef Wijsen

Abstract: Most data analytical pipelines often encounter the problem of querying inconsistent data that violate pre-determined integrity constraints. Data cleaning is an extensively studied paradigm that singles out a consistent repair of the inconsistent data. Consistent query answering (CQA) is an alternative approach to data cleaning that asks for all tuples guaranteed to be returned by a given query on… ▽ More Most data analytical pipelines often encounter the problem of querying inconsistent data that violate pre-determined integrity constraints. Data cleaning is an extensively studied paradigm that singles out a consistent repair of the inconsistent data. Consistent query answering (CQA) is an alternative approach to data cleaning that asks for all tuples guaranteed to be returned by a given query on all (in most cases, exponentially many) repairs of the inconsistent data. This paper identifies a class of acyclic select-project-join (SPJ) queries for which CQA can be solved via SQL rewriting with a linear time guarantee. Our rewriting method can be viewed as a generalization of Yannakakis's algorithm for acyclic joins to the inconsistent setting. We present LinCQA, a system that can output rewritings in both SQL and non-recursive Datalog rules for every query in this class. We show that LinCQA often outperforms the existing CQA systems on both synthetic and real-world workloads, and in some cases, by orders of magnitude. △ Less

Submitted 25 August, 2022; originally announced August 2022.

arXiv:2201.05566 [pdf, other]

Ranked Enumeration of Join Queries with Projections

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: Join query evaluation with ordering is a fundamental data processing task in relational database management systems. SQL and custom graph query languages such as Cypher offer this functionality by allowing users to specify the order via the ORDER BY clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the LIMIT clause), but the value of $k$ is not prede… ▽ More Join query evaluation with ordering is a fundamental data processing task in relational database management systems. SQL and custom graph query languages such as Cypher offer this functionality by allowing users to specify the order via the ORDER BY clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the LIMIT clause), but the value of $k$ is not predetermined as user queries are arriving in an online fashion. Recent work has made considerable progress in identifying optimal algorithms for ranked enumeration of join queries that do not contain any projections. In this paper, we initiate the study of the problem of enumerating results in ranked order for queries with projections. Our main result shows that for any acyclic query, it is possible to obtain a near-linear (in the size of the database) delay algorithm after only a linear time preprocessing step for two important ranking functions: sum and lexicographic ordering. For a practical subset of acyclic queries known as star queries, we show an even stronger result that allows a user to obtain a smooth tradeoff between faster answering time guarantees using more preprocessing time. Our results are also extensible to queries containing cycles and unions. We also perform a comprehensive experimental evaluation to demonstrate that our algorithms, which are simple to implement, improve up to three orders of magnitude in the running time over state-of-the-art algorithms implemented within open-source RDBMS and specialized graph databases. △ Less

Submitted 22 January, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

Comments: Accepted at VLDB 2022. Comments and suggestions are always welcome

arXiv:2201.04770 [pdf, other]

Certifiable Robustness for Nearest Neighbor Classifiers

Authors: Austen Z. Fan, Paraschos Koutris

Abstract: ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiab… ▽ More ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest. △ Less

Submitted 17 January, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: Accepted to ICDT'22

arXiv:2109.10889 [pdf, ps, other]

General Space-Time Tradeoffs via Relational Queries

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: In this paper, we investigate space-time tradeoffs for answering Boolean conjunctive queries. The goal is to create a data structure in an initial preprocessing phase and use it for answering (multiple) queries. Previous work has developed data structures that trade off space usage for answering time and has proved conditional space lower bounds for queries of practical interest such as the path a… ▽ More In this paper, we investigate space-time tradeoffs for answering Boolean conjunctive queries. The goal is to create a data structure in an initial preprocessing phase and use it for answering (multiple) queries. Previous work has developed data structures that trade off space usage for answering time and has proved conditional space lower bounds for queries of practical interest such as the path and triangle query. However, most of these results cater to only those queries, lack a comprehensive framework, and are not generalizable. The isolated treatment of these queries also fails to utilize the connections with extensive research on related problems within the database community. The key insight in this work is to exploit the formalism of relational algebra by casting the problems as answering join queries over a relational database. Using the notion of boolean {\em adorned queries} and {\em access patterns}, we propose a unified framework that captures several widely studied algorithmic problems. Our main contribution is three-fold. First, we present an algorithm that recovers existing space-time tradeoffs for several problems. The algorithm is based on an application of the {\em join size bound} to capture the space usage of our data structure. We combine our data structure with {\em query decomposition} techniques to further improve the tradeoffs and show that it is readily extensible to queries with negation. Second, we falsify two proposed conjectures in the existing literature related to the space-time lower bound for path queries and triangle detection for which we show unexpectedly better algorithms. This result opens a new avenue for improving several algorithmic results that have so far been assumed to be (conditionally) optimal. Finally, we prove new conditional space-time lower bounds for star and path queries. △ Less

Submitted 13 August, 2023; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: Appeared in WADS 2023. Comments and suggestions are always welcome

arXiv:2101.03712 [pdf, other]

Enumeration Algorithms for Conjunctive Queries with Projection

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to ma… ▽ More We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which maybe of independent interest. In particular, for star queries, we design combinatorial algorithms that provide instance-specific delay guarantees in linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present new results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. Boolean matrix multiplication is an important query that can be expressed as a CQ with projection where the join attribute is projected away. Our results can therefore also be interpreted as sparse, output-sensitive matrix multiplication with delay guarantees. △ Less

Submitted 31 October, 2021; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: journal version for LMCS

arXiv:2011.05549 [pdf, other]

Comprehensive and Efficient Workload Compression

Authors: Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey Naughton, Stratis Viglas

Abstract: This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be con… ▽ More This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e.,~a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload {\em representativity} and {\em coverage}. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques. △ Less

Submitted 3 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

arXiv:2009.11463 [pdf, other]

Algorithms for a Topology-aware Massively Parallel Computation Model

Authors: Xiao Hu, Paraschos Koutris, Spyros Blanas

Abstract: Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability, and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue… ▽ More Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability, and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds, and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds, as well as (asymptotically) matching upper bounds. The optimality of our algorithms is with respect to the initial data distribution among the network nodes, instead of assuming worst-case distribution as in previous results. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well. △ Less

Submitted 23 September, 2020; originally announced September 2020.

arXiv:2005.08439 [pdf, other]

A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism

Authors: Zhiwei Fan, Rathijit Sen, Paraschos Koutris, Aws Albarghouthi

Abstract: There is a large body of recent work applying machine learning (ML) techniques to query optimization and query performance prediction in relational database management systems (RDBMSs). However, these works typically ignore the effect of \textit{intra-parallelism} -- a key component used to boost the performance of OLAP queries in practice -- on query performance prediction. In this paper, we take… ▽ More There is a large body of recent work applying machine learning (ML) techniques to query optimization and query performance prediction in relational database management systems (RDBMSs). However, these works typically ignore the effect of \textit{intra-parallelism} -- a key component used to boost the performance of OLAP queries in practice -- on query performance prediction. In this paper, we take a first step towards filling this gap by studying the problem of \textit{tuning the degree of parallelism (DOP) via ML techniques} in Microsoft SQL Server, a popular commercial RDBMS that allows an individual query to execute using multiple cores. In our study, we cast the problem of DOP tuning as a {\em regression} task, and examine how several popular ML models can help with query performance prediction in a multi-core setting. We explore the design space and perform an extensive experimental study comparing different models against a list of performance metrics, testing how well they generalize in different settings: $(i)$ to queries from the same template, $(ii)$ to queries from a new template, $(iii)$ to instances of different scale, and $(iv)$ to different instances and queries. Our experimental results show that a simple featurization of the input query plan that ignores cost model estimations can accurately predict query performance, capture the speedup trend with respect to the available parallelism, as well as help with automatically choosing an optimal per-query DOP. △ Less

Submitted 21 May, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

arXiv:2002.12459 [pdf, other]

Fast Join Project Query Evaluation using Matrix Multiplication

Authors: Shaleen Deep, Xiao Hu, Paraschos Koutris

Abstract: In the last few years, much effort has been devoted to develo** join algorithms in order to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in develo** succinct algorithms that achieve worst-case optimal runtime for full join queries, i.e the join is over all variables present in the input database.… ▽ More In the last few years, much effort has been devoted to develo** join algorithms in order to achieve worst-case optimality for join queries over relational databases. Towards this end, the database community has had considerable success in develo** succinct algorithms that achieve worst-case optimal runtime for full join queries, i.e the join is over all variables present in the input database. However, not much is known about join evaluation with {\em projections} beyond some simple techniques of pushing down the projection operator in the query execution plan. Such queries have a large number of applications in entity matching, graph analytics and searching over compressed graphs. In this paper, we study how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication. Crucially, our algorithms are parameterized by the output size of the final result, allowing for choice of the best execution strategy. We implement our algorithms as a subroutine and compare the performance with state-of-the-art techniques to show they can be improved upon by as much as 50x. More importantly, our experiments indicate that matrix multiplication is a useful operation that can help speed up join processing owing to highly optimized open source libraries that are also highly parallelizable. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:2002.01531 [pdf, other]

Providing Insights for Queries affected by Failures and Stragglers

Authors: Bruhathi Sundarmurthy, Harshad Deshmukh, Paris Koutris, Jeffrey Naughton

Abstract: Interactive time responses are a crucial requirement for users analyzing large amounts of data. Such analytical queries are typically run in a distributed setting, with data being sharded across thousands of nodes for high throughput. However, providing real-time analytics is still a very big challenge; with data distributed across thousands of nodes, the probability that some of the required node… ▽ More Interactive time responses are a crucial requirement for users analyzing large amounts of data. Such analytical queries are typically run in a distributed setting, with data being sharded across thousands of nodes for high throughput. However, providing real-time analytics is still a very big challenge; with data distributed across thousands of nodes, the probability that some of the required nodes are unavailable or very slow during query execution is very high and unavailability may result in slow execution or even failures. The sheer magnitude of data and users increase resource contention and this exacerbates the phenomenon of stragglers and node failures during execution. In this paper, we propose a novel solution to alleviate the straggler/failure problem that exploits existing efficient partitioning properties of the data, particularly, co-hash partitioned data, and provides approximate answers along with confidence bounds to queries affected by failed/straggler nodes. We consider aggregate queries that involve joins, group bys, having clauses and a subclass of nested subqueries. Finally, we validate our approach through extensive experiments on the TPC-H dataset. △ Less

Submitted 4 February, 2020; originally announced February 2020.

arXiv:1909.00845 [pdf, other]

Revenue Maximization for Query Pricing

Authors: Shuchi Chawla, Shaleen Deep, Paraschos Koutris, Yifeng Teng

Abstract: Buying and selling of data online has increased substantially over the last few years. Several frameworks have already been proposed that study query pricing in theory and practice. The key guiding principle in these works is the notion of {\em arbitrage-freeness} where the broker can set different prices for different queries made to the dataset, but must ensure that the pricing function does not… ▽ More Buying and selling of data online has increased substantially over the last few years. Several frameworks have already been proposed that study query pricing in theory and practice. The key guiding principle in these works is the notion of {\em arbitrage-freeness} where the broker can set different prices for different queries made to the dataset, but must ensure that the pricing function does not provide the buyers with opportunities for arbitrage. However, little is known about revenue maximization aspect of query pricing. In this paper, we study the problem faced by a broker selling access to data with the goal of maximizing her revenue. We show that this problem can be formulated as a revenue maximization problem with single-minded buyers and unlimited supply, for which several approximation algorithms are known. We perform an extensive empirical evaluation of the performance of several pricing algorithms for the query pricing problem on real-world instances. In addition to previously known approximation algorithms, we propose several new heuristics and analyze them both theoretically and experimentally. Our experiments show that algorithms with the best theoretical bounds are not necessarily the best empirically. We identify algorithms and heuristics that are both fast and also provide consistently good performance when valuations are drawn from a variety of distributions. △ Less

Submitted 9 September, 2019; v1 submitted 2 September, 2019; originally announced September 2019.

Comments: To appear in PVLDB; version 2 with some cosmetic changes

arXiv:1902.02698 [pdf, ps, other]

Ranked Enumeration of Conjunctive Query Results

Authors: Shaleen Deep, Paraschos Koutris

Abstract: We investigate the enumeration of top-k answers for conjunctive queries against relational databases according to a given ranking function. The task is to design data structures and algorithms that allow for efficient enumeration after a preprocessing phase. Our main contribution is a novel priority queue based algorithm with near-optimal delay and non-trivial space guarantees that are output sens… ▽ More We investigate the enumeration of top-k answers for conjunctive queries against relational databases according to a given ranking function. The task is to design data structures and algorithms that allow for efficient enumeration after a preprocessing phase. Our main contribution is a novel priority queue based algorithm with near-optimal delay and non-trivial space guarantees that are output sensitive and depend on structure of the query. In particular, we exploit certain desirable properties of ranking functions that frequently occur in practice and degree information in the database instance, allowing for efficient enumeration. We introduce the notion of {\em decomposable} and {\em compatible} ranking functions in conjunction with query decomposition, a property that allows for partial aggregation of tuple scores in order to efficiently enumerate the ranked output. We complement the algorithmic results with lower bounds justifying why certain assumptions about properties of ranking functions are necessary and discuss popular conjectures providing evidence for optimality of enumeration delay guarantees. Our results extend and improve upon a long line of work that has studied ranked enumeration from both theoretical and practical perspective. △ Less

Submitted 31 October, 2021; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: LMCS journal submission

arXiv:1812.03975 [pdf, other]

Scaling-Up In-Memory Datalog Processing: Observations and Techniques

Authors: Zhiwei Fan, Jianqiao Zhu, Zuyu Zhang, Aws Albarghouthi, Paraschos Koutris, Jignesh Patel

Abstract: Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the large volumes of data being processed, several research efforts, across multiple communities, have explored how to scale up recursive queries, typically expresse… ▽ More Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the large volumes of data being processed, several research efforts, across multiple communities, have explored how to scale up recursive queries, typically expressed in Datalog. Our experience with these tools indicated that their performance does not translate across domains (e.g., a tool design for large-scale graph analytics does not exhibit the same performance on program-analysis tasks, and vice versa). As a result, we designed and implemented a general-purpose Datalog engine, called RecStep, on top of a parallel single-node relational system. In this paper, we outline the different techniques we use in RecStep, and the contribution of each technique to overall performance. We also present results from a detailed set of experiments comparing RecStep with a number of other Datalog systems using both graph analytics and program-analysis tasks, summarizing pros and cons of existing techniques based on the analysis of our observations. We show that RecStep generally outperforms the state-of-the-art parallel Datalog engines on complex and large-scale Datalog program evaluation, by a 4-6X margin. An additional insight from our work is that we show that it is possible to build a high-performance Datalog system on top of a relational engine, an idea that has been dismissed in past work in this area. △ Less

Submitted 10 December, 2018; originally announced December 2018.

arXiv:1810.03386 [pdf, other]

Consistent Query Answering for Primary Keys in Logspace

Authors: Paraschos Koutris, Jef Wijsen

Abstract: We study the complexity of consistent query answering on databases that may violate primary key constraints. A repair of such a database is any consistent database that can be obtained by deleting a minimal set of tuples. For every Boolean query q, CERTAINTY(q) is the problem that takes a database as input and asks whether q evaluates to true on every repair. In [KW17], the authors show that for e… ▽ More We study the complexity of consistent query answering on databases that may violate primary key constraints. A repair of such a database is any consistent database that can be obtained by deleting a minimal set of tuples. For every Boolean query q, CERTAINTY(q) is the problem that takes a database as input and asks whether q evaluates to true on every repair. In [KW17], the authors show that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either in P or coNP-complete, and it is decidable which of the two cases applies. In this paper, we sharpen this result by showing that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either expressible in symmetric stratified Datalog or coNP-complete. Since symmetric stratified Datalog is in L, we thus obtain a complexity-theoretic dichotomy between L and coNP-complete. Another new finding of practical importance is that CERTAINTY(q) is on the logspace side of the dichotomy for queries q where all join conditions express foreign-to-primary key matches, which is undoubtedly the most common type of join condition. △ Less

Submitted 8 October, 2018; originally announced October 2018.

Comments: 51 pages

arXiv:1809.08738 [pdf, other]

Scalable inference of topic evolution via models for latent geometric structures

Authors: Mikhail Yurochkin, Zhiwei Fan, Aritra Guha, Paraschos Koutris, XuanLong Nguyen

Abstract: We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli proce… ▽ More We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli process and the Hungarian matching algorithm, our method is shown to be several orders of magnitude faster than existing topic modeling approaches, as demonstrated by experiments working with several million documents in under two dozens of minutes. △ Less

Submitted 1 November, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

Comments: NeurIPS 2019

arXiv:1806.03791 [pdf, other]

The Effect of Network Width on the Performance of Large-batch Training

Authors: Lingjiao Chen, Hongyi Wang, **man Zhao, Dimitris Papailiopoulos, Paraschos Koutris

Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards a… ▽ More Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants. △ Less

Submitted 10 June, 2018; originally announced June 2018.

arXiv:1805.11450 [pdf, other]

Model-based Pricing for Machine Learning in a Data Marketplace

Authors: Lingjiao Chen, Paraschos Koutris, Arun Kumar

Abstract: Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers' revenue and buyers' affordability and efficienc… ▽ More Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers' revenue and buyers' affordability and efficiency. In this paper, we propose a model-based pricing (MBP) framework, which instead of pricing the data, directly prices ML model instances. We first formally describe the desired properties of the MBP framework, with a focus on avoiding arbitrage. Next, we show a concrete realization of the MBP framework via a noise injection approach, which provably satisfies the desired formal properties. Based on the proposed framework, we then provide algorithmic solutions on how the seller can assign prices to models under different market scenarios (such as to maximize revenue). Finally, we conduct extensive experiments, which validate that the MBP framework can provide high revenue to the seller, high affordability to the buyer, and also operate on low runtime cost. △ Less

Submitted 26 May, 2018; originally announced May 2018.

arXiv:1709.06186 [pdf, ps, other]

Compressed Representations of Conjunctive Query Results

Authors: Shaleen Deep, Paraschos Koutris

Abstract: Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries,… ▽ More Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern. In particular, we initiate the study of an important tradeoff: minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay for an access request over the result. Our main contribution is a novel parameterized data structure, which can be tuned to trade off space for answer time. The tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the access pattern. We show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries. △ Less

Submitted 27 March, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: To appear in PODS'18; 35 pages; comments welcome

arXiv:1606.09376 [pdf, ps, other]

The Design of Arbitrage-Free Data Pricing Schemes

Authors: Shaleen Deep, Paraschos Koutris

Abstract: Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic co… ▽ More Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic construction. In this work, we present a formal framework for pricing queries over data that allows the construction of general families of pricing functions, with the main goal of avoiding arbitrage. We consider two types of pricing schemes: instance-independent schemes, where the price depends only on the structure of the query, and answer-dependent schemes, where the price also depends on the query output. Our main result is a complete characterization of the structure of pricing functions in both settings, by relating it to properties of a function over a lattice. We use our characterization, together with information-theoretic methods, to construct a variety of arbitrage-free pricing functions. Finally, we discuss various tradeoffs in the design space and present techniques for efficient computation of the proposed pricing functions. △ Less

Submitted 30 June, 2016; originally announced June 2016.

Comments: full paper

arXiv:1604.01848 [pdf, other]

Worst-Case Optimal Algorithms for Parallel Query Processing

Authors: Paul Beame, Paraschos Koutris, Dan Suciu

Abstract: In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis… ▽ More In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of [18] for sequential algorithms. We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query $q$ when all relations have size equal to $M$ is $O(M/p^{1/ψ^*})$, where $ψ^*$ is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries. △ Less

Submitted 6 April, 2016; originally announced April 2016.

arXiv:1603.08393 [pdf, other]

$k$-shot Broadcasting in Ad Hoc Radio Networks

Authors: Sushanta Karmakar, Paraschos Koutris, Aris Pagourtzis, Dimitris Sakavalas

Abstract: We study distributed broadcasting protocols with few transmissions (`shots') in radio networks where the topology is unknown. In particular, we examine the case in which a bound $k$ is given and a node may transmit at most $k$ times during the broadcasting protocol. Initially, we focus on oblivious algorithms for $k$-shot broadcasting, that is, algorithms where each node decides whether to transmi… ▽ More We study distributed broadcasting protocols with few transmissions (`shots') in radio networks where the topology is unknown. In particular, we examine the case in which a bound $k$ is given and a node may transmit at most $k$ times during the broadcasting protocol. Initially, we focus on oblivious algorithms for $k$-shot broadcasting, that is, algorithms where each node decides whether to transmit or not with no consideration of the transmission history. Our main contributions are (a) a lower bound of $Ω(n^2/k)$ on the broadcasting time of any oblivious $k$-shot broadcasting algorithm and (b) an oblivious broadcasting protocol that achieves a matching upper bound, namely $O(n^2/k)$, for every $k \le \sqrt{n}$ and an upper bound of $O(n^{3/2})$ for every $k > \sqrt{n}$. We also study the general case of adaptive broadcasting protocols where nodes decide whether to transmit based on all the available information, namely the transmission history known by each. We prove a lower bound of $Ω\left(n^{\frac{1+k}{k}}\right)$ on the broadcasting time of any protocol by introducing the \emph{transmission tree} construction which generalizes previous approaches. △ Less

Submitted 28 March, 2016; originally announced March 2016.

Comments: 21 pages, 2 figures, preliminary version presented in CATS 2011: 161-168

arXiv:1602.06236 [pdf, other]

Communication Cost in Parallel Query Processing

Authors: Paul Beame, Paraschos Koutris, Dan Suciu

Abstract: We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query $q$ and the skew in the data, we study tradeoffs between the number of processors, the number of rounds of communication, and the per-processor load -- the number of bits each processor can send or can receive in a single round -- that are… ▽ More We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query $q$ and the skew in the data, we study tradeoffs between the number of processors, the number of rounds of communication, and the per-processor load -- the number of bits each processor can send or can receive in a single round -- that are required to compute $q$. When the data is free of skew, we obtain essentially tight upper and lower bounds for one round algorithms and we show how the bounds degrade when there is skew in the data. In the case of skewed data, we show how to improve the algorithms when approximate degrees of the heavy-hitter elements are available, obtaining essentially optimal algorithms for queries such as simple joins and triangle join queries. For queries that we identify as tree-like, we also prove nearly matching upper and lower bounds for multi-round algorithms for a natural class of skew-free databases. One consequence of these latter lower bounds is that for any $\varepsilon>0$, using $p$ processors to compute the connected components of a graph, or to output the path, if any, between a specified pair of vertices of a graph with $m$ edges and per-processor load that is $O(m/p^{1-\varepsilon})$ requires $Ω(\log p)$ rounds of communication. Our upper bounds are given by simple structured algorithms using MapReduce. Our one-round lower bounds are proved in a very general model, which we call the Massively Parallel Communication (MPC) model, that allows processors to communicate arbitrary bits. Our multi-round lower bounds apply in a restricted version of the MPC model in which processors in subsequent rounds after the first communication round are only allowed to send tuples. △ Less

Submitted 19 February, 2016; originally announced February 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1306.5972

arXiv:1501.07864 [pdf, other]

A Trichotomy in the Data Complexity of Certain Query Answering for Conjunctive Queries

Authors: Paraschos Koutris, Jef Wijsen

Abstract: A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db on input, and asks whether q i… ▽ More A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db on input, and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this paper, we combine existing techniques for studying the above complexity classification task. We show that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete, and the complexity dichotomy is effective. This settles a research question that has been open for ten years, since [9]. △ Less

Submitted 30 January, 2015; originally announced January 2015.

arXiv:1412.3869 [pdf, other]

Answering Conjunctive Queries with Inequalities

Authors: Paraschos Koutris, Tova Milo, Sudeepa Roy, Dan Suciu

Abstract: In this paper, we study the complexity of answering conjunctive queries (CQ) with inequalities). In particular, we are interested in comparing the complexity of the query with and without inequalities. The main contribution of our work is a novel combinatorial technique that enables us to use any Select-Project-Join query plan for a given CQ without inequalities in answering the CQ with inequaliti… ▽ More In this paper, we study the complexity of answering conjunctive queries (CQ) with inequalities). In particular, we are interested in comparing the complexity of the query with and without inequalities. The main contribution of our work is a novel combinatorial technique that enables us to use any Select-Project-Join query plan for a given CQ without inequalities in answering the CQ with inequalities, with an additional factor in running time that only depends on the query. The key idea is to define a new projection operator, which keeps a small representation (independent of the size of the database) of the set of input tuples that map to each tuple in the output of the projection; this representation is used to evaluate all the inequalities in the query. Second, we generalize a result by Papadimitriou-Yannakakis [17] and give an alternative algorithm based on the color-coding technique [4] to evaluate a CQ with inequalities by using an algorithm for the CQ without inequalities. Third, we investigate the structure of the query graph, inequality graph, and the augmented query graph with inequalities, and show that even if the query and the inequality graphs have bounded treewidth, the augmented graph not only can have an unbounded treewidth but can also be NP-hard to evaluate. Further, we illustrate classes of queries and inequalities where the augmented graphs have unbounded treewidth, but the CQ with inequalities can be evaluated in poly-time. Finally, we give necessary properties and sufficient properties that allow a class of CQs to have poly-time combined complexity with respect to any inequality pattern. We also illustrate classes of queries where our query-plan-based technique outperforms the alternative approaches discussed in the paper. △ Less

Submitted 11 December, 2014; originally announced December 2014.

arXiv:1401.1872 [pdf, other]

Skew in Parallel Query Processing

Authors: Paul Beame, Paraschos Koutris, Dan Suciu

Abstract: We study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case where the data is skewed, which is a major challenge for scalable parallel query processing. We establish a tight connection between the fractiona… ▽ More We study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case where the data is skewed, which is a major challenge for scalable parallel query processing. We establish a tight connection between the fractional edge packings of the query and the amount of communication, in two cases. First, in the case when the only statistics on the database are the cardinalities of the input relations, and the data is skew-free, we provide matching upper and lower bounds (up to a poly log p factor) expressed in terms of fractional edge packings of the query q. Second, in the case when the relations are skewed and the heavy hitters and their frequencies are known, we provide upper and lower bounds (up to a poly log p factor) expressed in terms of packings of residual queries obtained by specializing the query to a heavy hitter. All our lower bounds are expressed in the strongest form, as number of bits needed to be communicated between processors with unlimited computational power. Our results generalizes some prior results on uniform databases (where each relation is a matching) [4], and other lower bounds for the MapReduce model [1]. △ Less

Submitted 8 January, 2014; originally announced January 2014.

arXiv:1306.5972 [pdf, other]

Communication Steps for Parallel Query Processing

Authors: Paul Beame, Paraschos Koutris, Dan Suciu

Abstract: We consider the problem of computing a relational query $q$ on a large input database of size $n$, using a large number $p$ of servers. The computation is performed in rounds, and each server can receive only $O(n/p^{1-\varepsilon})$ bits of data, where $\varepsilon \in [0,1]$ is a parameter that controls replication. We examine how many global communication steps are needed to compute $q$. We est… ▽ More We consider the problem of computing a relational query $q$ on a large input database of size $n$, using a large number $p$ of servers. The computation is performed in rounds, and each server can receive only $O(n/p^{1-\varepsilon})$ bits of data, where $\varepsilon \in [0,1]$ is a parameter that controls replication. We examine how many global communication steps are needed to compute $q$. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires $\varepsilon \geq 1-1/τ^*$, where $τ^*$ is the fractional vertex cover of the hypergraph of $q$. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent $\varepsilon$. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication. △ Less

Submitted 25 June, 2013; originally announced June 2013.

ACM Class: H.2.4

arXiv:1212.6636 [pdf, other]

A Dichotomy on the Complexity of Consistent Query Answering for Atoms with Simple Keys

Authors: Paraschos Koutris, Dan Suciu

Abstract: We study the problem of consistent query answering under primary key violations. In this setting, the relations in a database violate the key constraints and we are interested in maximal subsets of the database that satisfy the constraints, which we call repairs. For a boolean query Q, the problem CERTAINTY(Q) asks whether every such repair satisfies the query or not; the problem is known to be al… ▽ More We study the problem of consistent query answering under primary key violations. In this setting, the relations in a database violate the key constraints and we are interested in maximal subsets of the database that satisfy the constraints, which we call repairs. For a boolean query Q, the problem CERTAINTY(Q) asks whether every such repair satisfies the query or not; the problem is known to be always in coNP for conjunctive queries. However, there are queries for which it can be solved in polynomial time. It has been conjectured that there exists a dichotomy on the complexity of CERTAINTY(Q) for conjunctive queries: it is either in PTIME or coNP-complete. In this paper, we prove that the conjecture is indeed true for the case of conjunctive queries without self-joins, where each atom has as a key either a single attribute (simple key) or all attributes of the atom. △ Less

Submitted 15 January, 2014; v1 submitted 29 December, 2012; originally announced December 2012.

arXiv:1109.5325 [pdf, other]

Online Sum-Radii Clustering

Authors: Dimitris Fotakis, Paraschos Koutris

Abstract: In Online Sum-Radii Clustering, n demand points arrive online and must be irrevocably assigned to a cluster upon arrival. The cost of each cluster is the sum of a fixed opening cost and its radius, and the objective is to minimize the total cost of the clusters opened by the algorithm. We show that the deterministic competitive ratio of Online Sum-Radii Clustering for general metric spaces is Θ(\l… ▽ More In Online Sum-Radii Clustering, n demand points arrive online and must be irrevocably assigned to a cluster upon arrival. The cost of each cluster is the sum of a fixed opening cost and its radius, and the objective is to minimize the total cost of the clusters opened by the algorithm. We show that the deterministic competitive ratio of Online Sum-Radii Clustering for general metric spaces is Θ(\log n), where the upper bound follows from a primal-dual algorithm and holds for general metric spaces, and the lower bound is valid for ternary Hierarchically Well-Separated Trees (HSTs) and for the Euclidean plane. Combined with the results of (Csirik et al., MFCS 2010), this result demonstrates that the deterministic competitive ratio of Online Sum-Radii Clustering changes abruptly, from constant to logarithmic, when we move from the line to the plane. We also show that Online Sum-Radii Clustering in metric spaces induced by HSTs is closely related to the Parking Permit problem introduced by (Meyerson, FOCS 2005). Exploiting the relation to Parking Permit, we obtain a lower bound of Ω(\log\log n) on the randomized competitive ratio of Online Sum-Radii Clustering in tree metrics. Moreover, we present a simple randomized O(\log n)-competitive algorithm, and a deterministic O(\log\log n)-competitive algorithm for the fractional version of the problem. △ Less

Submitted 18 February, 2013; v1 submitted 25 September, 2011; originally announced September 2011.

Comments: Supported by the project AlgoNow, co-financed by the European Union (European Social Fund - ESF) and Greek national funds, through the Operational Program "Education and Lifelong Learning", under the research funding program THALES. An extended abstract of this work appeared in the Proc. of MFCS 2012, B. Rovan, V. Sassone, and P. Widmayer (Editors), LNCS 7464, pp. 395-406, 2012

Showing 1–38 of 38 results for author: Koutris, P