Search | arXiv e-print repository

A Comprehensive Tutorial on over 100 Years of Diagrammatic Representations of Logical Statements and Relational Queries

Abstract: Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand relational queries, whether human-generated or automatically generate… ▽ More Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: develo** visual metaphors that help users understand relational queries. This lecture-style tutorial surveys the key visual metaphors developed for diagrammatic representations of logical statements and relational expressions, across both the relational database and the much older diagrammatic reasoning communities. We survey the history and state-of-the-art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigations into diagrammatic languages, and organize the landscape by map** the visual alphabets of diagrammatic representation systems to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC). Tutorial website: https://northeastern-datalab.github.io/diagrammatic-representation-tutorial/ △ Less

Submitted 4 March, 2024; originally announced April 2024.

Comments: 6 pages, 2 figures, preprint of ICDE 2024 tutorial. arXiv admin note: substantial text overlap with arXiv:2308.10319

arXiv:2401.04758 [pdf, other]

doi 10.1145/3639316

On The Reasonable Effectiveness of Relational Diagrams: Explaining Relational Query Patterns and the Pattern Expressiveness of Relational Languages

Authors: Wolfgang Gatterbauer, Cody Dunne

Abstract: Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than "a certain way of writing a query"? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the b… ▽ More Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than "a certain way of writing a query"? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure-preserving map**s between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational language fragments and create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. Notably, for the non-disjunctive language fragment, we show that relational calculus can express a larger class of patterns than the basic operators of relational algebra. Our language-independent definition of query patterns opens novel paths for assisting database users. For example, these patterns could be leveraged to create visual query representations that faithfully represent query patterns, speed up interpretation, and provide visual feedback during query editing. As a concrete example, we propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus that is provably (i) unambiguous, (ii) relationally complete, and (iii) able to represent all query patterns for unions of non-disjunctive queries. Among all diagrammatic representations for relational queries that we are aware of, ours is the only one with these three properties. Furthermore, our anonymously preregistered user study shows that Relational Diagrams allow users to recognize patterns meaningfully faster and more accurately than SQL. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: 71 pages, 49 figures, full version of SIGMOD 2024 paper of same title: https://doi.org/10.1145/3639316. arXiv admin note: text overlap with arXiv:2203.07284

arXiv:2401.00013 [pdf, other]

HITSnDIFFs: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property

Authors: Zixuan Chen, Subhodeep Mitra, R Ravi, Wolfgang Gatterbauer

Abstract: We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the "truth"), our problem is to determine a ranking of the users based on their ability to answer questions.… ▽ More We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the "truth"), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem "ability discovery" to emphasize the connection to and duality with the more well-studied problem of "truth discovery". To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call "HITSNDIFFS" (or HND), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), HND also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that HND produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm. △ Less

Submitted 21 December, 2023; originally announced January 2024.

Comments: 22 pages, 14 figures, long version of of ICDE 2024 conference paper

arXiv:2308.10319 [pdf, other]

doi 10.14778/3611540.3611578

A Tutorial on Visual Representations of Relational Queries

Authors: Wolfgang Gatterbauer

Abstract: Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically… ▽ More Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: develo** visual metaphors that help users understand relational queries. This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by map** their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC). △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: 4 page tutorial paper at VLDB 2023, tutorial web page with slides to be posted in time: https://northeastern-datalab.github.io/visual-query-representation-tutorial/. arXiv admin note: text overlap with arXiv:2208.01613

arXiv:2307.00465 [pdf, other]

Towards Unbiased Exploration in Partial Label Learning

Authors: Zsolt Zombori, Agapi Rissaki, Kristóf Szabó, Wolfgang Gatterbauer, Michael Benedikt

Abstract: We consider learning a probabilistic classifier from partially-labelled supervision (inputs denoted with multiple possibilities) using standard neural architectures with a softmax as the final layer. We identify a bias phenomenon that can arise from the softmax layer in even simple architectures that prevents proper exploration of alternative options, making the dynamics of gradient descent overly… ▽ More We consider learning a probabilistic classifier from partially-labelled supervision (inputs denoted with multiple possibilities) using standard neural architectures with a softmax as the final layer. We identify a bias phenomenon that can arise from the softmax layer in even simple architectures that prevents proper exploration of alternative options, making the dynamics of gradient descent overly sensitive to initialisation. We introduce a novel loss function that allows for unbiased exploration within the space of alternative outputs. We give a theoretical justification for our loss function, and provide an extensive evaluation of its impact on synthetic data, on standard partially labelled benchmarks and on a contributed novel benchmark related to an existing rule learning challenge. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2305.16525 [pdf, other]

doi 10.1145/3584372.3588670

Efficient Computation of Quantiles over Joins

Authors: Nikolaos Tziavelis, Nofar Carmeli, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

Abstract: We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent… ▽ More We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at a specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all join answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent dichotomy result rules out the existence of such an algorithm for a general family of queries and orders. Specifically, for acyclic JQs without self-joins, the problem becomes intractable for ordering by sum whenever we join more than two relations (and these joins are not trivial intersections). Moreover, even for basic ranking functions beyond sum, such as min or max over different attributes, so far it is not known whether there is any nontrivial tractable %JQ. In this work, we develop a new approach to solving %JQ. Our solution uses two subroutines: The first one needs to select what we call a "pivot answer". The second subroutine partitions the space of query answers according to this pivot, and continues searching in one partition that is represented as new %JQ over a new database. For pivot selection, we develop an algorithm that works for a large class of ranking functions that are appropriately monotone. The second subroutine requires a customized construction for the specific ranking function at hand. We show the benefit and generality of our approach by using it to establish several new complexity results. First, we prove the tractability of min and max for all acyclic JQs, thereby resolving the above question. Second, we extend the previous %JQ dichotomy for sum to all partial sums. Third, we handle the intractable cases of sum by devising a deterministic approximation scheme that applies to every acyclic JQ. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2212.08898 [pdf, other]

doi 10.1145/3626715

A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations

Authors: Neha Makhija, Wolfgang Gatterbauer

Abstract: Resilience is one of the key algorithmic problems underlying various forms of reverse data management (such as view maintenance, deletion propagation, and various interventions for fairness): What is the minimal number of tuples to delete from a database in order to remove all answers from a query? A long-open question is determining those conjunctive queries (CQs) for which this problem can be so… ▽ More Resilience is one of the key algorithmic problems underlying various forms of reverse data management (such as view maintenance, deletion propagation, and various interventions for fairness): What is the minimal number of tuples to delete from a database in order to remove all answers from a query? A long-open question is determining those conjunctive queries (CQs) for which this problem can be solved in guaranteed PTIME. We shed new light on this and the related problem of causal responsibility by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both prior studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (e.g., all CQs under set or bag semantics It is also unified in that all queries and all instances are treated with the same approach, and the algorithm is guaranteed to terminate in PTIME for the easy cases. We prove that, for all easy self-join-free CQs, the Linear Programming (LP) relaxation of our encoding is identical to the ILP solution and thus standard ILP solvers are guaranteed to return the solution in PTIME. Our approach opens up the door to new variants and new fine-grained analysis: 1) It also works under bag semantics and we give the first dichotomy result for bags semantics in the problem space. 2) We give a more fine-grained analysis of the complexity of causal responsibility. 3) We recover easy instances for generally hard queries, such as instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture from PODS 2020. 5) Experiments confirm that our results indeed predict the asymptotic running times, and that our universal ILP encoding is at times even faster to solve for the PTIME cases than a prior proposed dedicated flow algorithm. △ Less

Submitted 20 October, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

Comments: 43 pages, 15 figures

Journal ref: Proc. ACM Manag. Data 1, 4 (SIGMOD), Article 228 (December 2023), 43 pages

arXiv:2209.13589 [pdf, other]

SANTOS: Relationship-based Semantic Table Union Search

Authors: Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Abstract: Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new n… ▽ More Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: 15 pages, 10 figures, to appear at SIGMOD 2023

arXiv:2208.01613 [pdf, other]

Principles of Query Visualization

Authors: Wolfgang Gatterbauer, Cody Dunne, H. V. Jagadish, Mirek Riedewald

Abstract: Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data. Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: 20 pages, 12 figures, preprint for IEEE Data Engineering Bulletin

arXiv:2205.05649 [pdf, other]

Any-k Algorithms for Enumerating Ranked Answers to Conjunctive Queries

Authors: Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

Abstract: We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms, which, without knowing the number k of desired answers, push down the ranking into joins by carefully ordering the computation of intermediate tuples and avoiding materialization of join answers until they are needed.… ▽ More We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms, which, without knowing the number k of desired answers, push down the ranking into joins by carefully ordering the computation of intermediate tuples and avoiding materialization of join answers until they are needed. For this to be possible, the ranking function needs to obey a particular type of monotonicity. Supported ranking functions include the common sum-of-weights, where answers are compared by the sum of input-tuple weights, as well as any commutative selective dioid. Our results extend a well-known unranked-enumeration dichotomy, which states that only free-connex CQs are tractable (under certain hardness hypotheses and for CQs without self-joins). For this class of queries and with n denoting the size of the input, the data complexity of our ranked enumeration approach for the time to the kth CQ answer is O(n+klogk), which is only a logarithmic factor slower than the O(n+k) unranked-enumeration guarantee. A core insight of our work is that ranked enumeration for CQs is closely related to Dynamic Programming and the fundamental task of path enumeration in a weighted DAG. We uncover a previously unknown tradeoff: one any-k algorithm has lower complexity when the number of returned answers is small, the other when their number is large. This tradeoff is eliminated under a stricter monotonicity property that we define and exploit for a novel algorithm that asymptotically dominates all previously known alternatives, including Eppstein's algorithm for sum-of-weights path enumeration. We empirically demonstrate the findings of our theoretical analysis in an experimental study that highlights the superiority of our approach over the join-then-rank approach that existing database systems follow. △ Less

Submitted 12 October, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

arXiv:2203.07284 [pdf, other]

Relational Diagrams: a pattern-preserving diagrammatic representation of non-disjunctive Relational Queries

Authors: Wolfgang Gatterbauer, Cody Dunne, Mirek Riedewald

Abstract: Analyzing relational languages by their logical expressiveness is well understood. Something not well understood or even formalized is the vague concept of relational query patterns. What are query patterns? And how can we reason about query patterns across different relational languages, irrespective of their syntax and their procedural or declarative nature? In this paper, we formalize the conce… ▽ More Analyzing relational languages by their logical expressiveness is well understood. Something not well understood or even formalized is the vague concept of relational query patterns. What are query patterns? And how can we reason about query patterns across different relational languages, irrespective of their syntax and their procedural or declarative nature? In this paper, we formalize the concept of query patterns with a variant of pattern-preserving map**s between the relational atoms of queries. This formalism allows us to analyze the relative pattern expressiveness of relational query languages and to create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. In this analysis, relational calculus can expressive more patterns than the basic operators of relational algebra. We additionally contribute an intuitive, complete, and sound diagrammatic representation of safe relational calculus that is not only relationally complete, but can also express all logical patterns for the large and useful fragment of non-disjunctive relational calculus. Among all diagrammatic representations for relational queries that we are aware of, this is the only one that is relationally complete and that can represent all logical patterns in the non-disjunctive fragment. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: 23 pages, 29 figures

arXiv:2105.14307 [pdf, other]

Minimally Factorizing the Provenance of Self-join Free Conjunctive Queries

Authors: Neha Makhija, Wolfgang Gatterbauer

Abstract: We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e., we want to find a formula that minimizes the number of variable repetitions. This problem is equivalent to solving the fundamental Boolean formula factorization problem for the restricted setting of the provenance formulas of self-join free queries. While general Boolean… ▽ More We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e., we want to find a formula that minimizes the number of variable repetitions. This problem is equivalent to solving the fundamental Boolean formula factorization problem for the restricted setting of the provenance formulas of self-join free queries. While general Boolean formula minimization is $Σ^p_2$-complete, we show that the problem is NP-C in our case. Additionally, we identify a large category of queries that can be solved in PTIME, expanding beyond the previously known tractable cases of read-once formulas and hierarchical queries. We describe connections between factorizations, Variable Elimination Orders (VEOs), and minimal query plans. We leverage these insights to create an Integer Linear Program (ILP) that can solve the minimal factorization problem exactly. We also propose a Max-Flow Min-Cut (MFMC) based algorithm that gives an efficient approximate solution. Importantly, we show that both the Linear Programming (LP) relaxation of our ILP, and our MFMC-based algorithm are always correct for all currently known PTIME cases. Thus, we present two unified algorithms (ILP and MFMC) that can both recover all known PTIME cases in PTIME, yet also solve NP-complete cases either exactly (ILP) or approximately (MFMC), as desired. △ Less

Submitted 14 May, 2024; v1 submitted 29 May, 2021; originally announced May 2021.

Comments: 57 pages, 38 figures

arXiv:2103.09940 [pdf, other]

DomainNet: Homograph Detection for Data Lake Disambiguation

Authors: Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Abstract: Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we sh… ▽ More Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%. △ Less

Submitted 22 March, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

Comments: Full version of paper appearing in EDBT 2021

arXiv:2101.12158 [pdf, other]

doi 10.14778/3476249.3476306

Beyond Equi-joins: Ranking, Enumeration and Factorization

Authors: Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

Abstract: We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with $n$ denoting the number of tuples in the database, we guarantee for acyclic full joi… ▽ More We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with $n$ denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of $k$, the $k$ top-ranked answers are returned in $\mathcal{O}(n \operatorname{polylog} n + k \log k)$ time. This is within a polylogarithmic factor of $\mathcal{O}(n + k \log k)$, i.e., the best known complexity for equi-joins, and even of $\mathcal{O}(n+k)$, i.e., the time it takes to look at the input and return $k$ answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is $n^\ell$ for a join of $\ell$ relations. The key ingredient is a novel $\mathcal{O}(n \operatorname{polylog} n)$-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude. △ Less

Submitted 30 August, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

Comments: 21 pages

Journal ref: PVLDB, 14(11):2599-2612, 2021

arXiv:2012.11965 [pdf, other]

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries

Authors: Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

Abstract: We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is… ▽ More We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability, and establish the corresponding generalizations of our characterizations for every set of unary FDs. △ Less

Submitted 28 November, 2022; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: 44 pages

arXiv:2005.00448 [pdf, other]

doi 10.1145/3318464.3383132

Optimal Join Algorithms Meet Top-k

Authors: Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

Abstract: Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the "best" or "most interesting" results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large int… ▽ More Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the "best" or "most interesting" results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: To be published in Proceedings ofthe 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD'20), June 14-19, 2020, Portland, OR, USA, 7 pages

arXiv:2004.11375 [pdf]

doi 10.1145/3318464.3389767

QueryVis: Logic-based diagrams help users understand complicated SQL queries faster

Authors: Aristotelis Leventidis, Jiahui Zhang, Cody Dunne, Wolfgang Gatterbauer, H. V. Jagadish, Mirek Riedewald

Abstract: Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in \emph{automatically-generated visual diagrams} that can help users understand the meaning of queries faster and more accurately than SQL text… ▽ More Understanding the meaning of existing SQL queries is critical for code maintenance and reuse. Yet SQL can be hard to read, even for expert users or the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in \emph{automatically-generated visual diagrams} that can help users understand the meaning of queries faster and more accurately than SQL text alone. We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture the meaning of deeply nested queries. Our diagrams build upon a rich history of diagrammatic reasoning systems in logic and were designed using a large body of human-computer interaction best practices: they are \emph{minimal} in that no visual element is superfluous; they are \emph{unambiguous} in that no two queries with different semantics map to the same visualization; and they \emph{extend} previously existing visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation involving 42 users on Amazon Mechanical Turk shows that with only a 2--3 minute static tutorial, participants could interpret queries meaningfully faster with our diagrams than when reading SQL alone. Moreover, we have evidence that our visual diagrams result in participants making fewer errors than with SQL. We believe that more regular exposure to diagrammatic representations of SQL can give rise to a \emph{pattern-based} and thus more intuitive use and re-use of SQL. All details on the experimental study, the evaluation stimuli, raw data, and analyses, and source code are available at https://osf.io/mycr2 △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: Full version of paper appearing in SIGMOD 2020

arXiv:2004.06101 [pdf, other]

doi 10.1145/3318464.3389750

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Authors: Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald

Abstract: We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings… ▽ More We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work. △ Less

Submitted 13 April, 2020; originally announced April 2020.

arXiv:2003.02829 [pdf, other]

doi 10.1145/3318464.3380577

Factorized Graph Representations for Semi-Supervised Learning from Sparse Data

Authors: Krishna Kumar P., Paul Langton, Wolfgang Gatterbauer

Abstract: Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that work iteratively starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix that must be provided by either domain experts or heuristics. Can we instead direct… ▽ More Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that work iteratively starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix that must be provided by either domain experts or heuristics. Can we instead directly estimate the correct compatibilities from a sparsely labeled graph in a principled and scalable way? We answer this question affirmatively and suggest a method called distant compatibility estimation that works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We define algebraic amplification as the more general idea of leveraging algebraic properties of an algorithm's update equations to amplify sparse signals. We show that our estimator is by orders of magnitude faster than an alternative approach and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap preprocessing step for any existing label propagation method and removes the current dependence on heuristics. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: SIGMOD 2020 (Extended version)

arXiv:1911.05582 [pdf, other]

Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries

Authors: Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, Xiaofeng Yang

Abstract: We study ranked enumeration of join-query results according to very general orders defined by selective dioids. Our main contribution is a framework for ranked enumeration over a class of dynamic programming problems that generalizes seemingly different problems that had been studied in isolation. To this end, we extend classic algorithms that find the k-shortest paths in a weighted graph. For ful… ▽ More We study ranked enumeration of join-query results according to very general orders defined by selective dioids. Our main contribution is a framework for ranked enumeration over a class of dynamic programming problems that generalizes seemingly different problems that had been studied in isolation. To this end, we extend classic algorithms that find the k-shortest paths in a weighted graph. For full conjunctive queries, including cyclic ones, our approach is optimal in terms of the time to return the top result and the delay between results. These optimality properties are derived for the widely used notion of data complexity, which treats query size as a constant. By performing a careful cost analysis, we are able to uncover a previously unknown tradeoff between two incomparable enumeration approaches: one has lower complexity when the number of returned results is small, the other when the number is very large. We theoretically and empirically demonstrate the superiority of our techniques over batch algorithms, which produce the full result and then sort it. Our technique is not only faster for returning the first few results, but on some inputs beats the batch algorithm even when all results are produced. △ Less

Submitted 11 September, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

Comments: 50 pages, 19 figures

arXiv:1907.01129 [pdf, other]

New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins

Authors: Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

Abstract: The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects. In this paper, we give several novel results on the hardness of the resilience problem for… ▽ More The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects. In this paper, we give several novel results on the hardness of the resilience problem for $\textit{binary conjunctive queries with self-joins}$ (i.e. conjunctive queries with relations of maximal arity 2) with one repeated relation. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various $NP$-hardness results. We also give novel involved reductions to network flow to show certain cases are in $P$. Overall, we give a dichotomy result for the restricted setting when one relation is repeated at most 2 times, and we cover many of the cases for 3. Although restricted, our results provide important insights into the problem of self-joins that we hope can help solve the general case of all conjunctive queries with self-joins in the future. △ Less

Submitted 15 June, 2020; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: 23 pages, 19 figures, included a new section

arXiv:1806.10078 [pdf, other]

A General Framework for Anytime Approximation in Probabilistic Databases

Authors: Maarten Van den Heuvel, Floris Geerts, Wolfgang Gatterbauer, Martin Theobald

Abstract: Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branch-and-bound with model-based bounds. We present here a more general branch-and-bound framework that extends the possible bounds by using 'dissociation', which yields tig… ▽ More Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branch-and-bound with model-based bounds. We present here a more general branch-and-bound framework that extends the possible bounds by using 'dissociation', which yields tighter bounds. △ Less

Submitted 3 July, 2018; v1 submitted 26 June, 2018; originally announced June 2018.

Comments: 3 pages, 2 figures, submitted to StarAI 2018 Workshop

arXiv:1802.06060 [pdf, other]

doi 10.1145/3178876.3186115

Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs

Authors: Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald, Alessandra Sala

Abstract: Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to nd the top-k matches ac- cording to a rank… ▽ More Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to nd the top-k matches ac- cording to a ranking function over edge and node weights. For users, it is di cult to select value k . We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, re- turn as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continues until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges. △ Less

Submitted 10 April, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: To appear in WWW 2018

arXiv:1612.04794 [pdf, other]

Algorithms for Automatic Ranking of Participants and Tasks in an Anonymized Contest

Authors: Yang Jiao, R. Ravi, Wolfgang Gatterbauer

Abstract: We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of anonymous participants and a set of undisclosed tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in… ▽ More We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of anonymous participants and a set of undisclosed tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participant-task relations by flip** a minimum number of participant-task relations, implying such a "nearest perfect ordering" to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NP-hard. We propose six natural $k$-near versions of the Chain Editing problem and classify their complexity. The input to a $k$-near Chain Editing problem includes an initial ordering of the participants (or tasks) that we are required to respect by moving each participant (or task) at most $k$ positions from the initial ordering. We obtain surprising results on the complexity of the six $k$-near problems: Five of the problems are polynomial-time solvable using dynamic programming, but one of them is NP-hard. △ Less

Submitted 20 December, 2016; v1 submitted 14 December, 2016; originally announced December 2016.

Comments: 21 pages, 5 figures, preprint, full version of paper to appear in WALCOM 2017

arXiv:1512.00537 [pdf, other]

Fault-Tolerant Entity Resolution with the Crowd

Authors: Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

Abstract: In recent years, crowdsourcing is increasingly applied as a means to enhance data quality. Although the crowd generates insightful information especially for complex problems such as entity resolution (ER), the output quality of crowd workers is often noisy. That is, workers may unintentionally generate false or contradicting data even for simple tasks. The challenge that we address in this paper… ▽ More In recent years, crowdsourcing is increasingly applied as a means to enhance data quality. Although the crowd generates insightful information especially for complex problems such as entity resolution (ER), the output quality of crowd workers is often noisy. That is, workers may unintentionally generate false or contradicting data even for simple tasks. The challenge that we address in this paper is how to minimize the cost for task requesters while maximizing ER result quality under the assumption of unreliable input from the crowd. For that purpose, we first establish how to deduce a consistent ER solution from noisy worker answers as part of the data interpretation problem. We then focus on the next-crowdsource problem which is to find the next task that maximizes the information gain of the ER result for the minimal additional cost. We compare our robust data interpretation strategies to alternative state-of-the-art approaches that do not incorporate the notion of fault-tolerance, i.e., the robustness to noise. In our experimental evaluation we show that our approaches yield a quality improvement of at least 20% for two real-world datasets. Furthermore, we examine task-to-worker assignment strategies as well as task parallelization techniques in terms of their cost and quality trade-offs in this paper. Based on both synthetic and crowdsourced datasets, we then draw conclusions on how to minimize cost while maintaining high quality ER results. △ Less

Submitted 1 December, 2015; originally announced December 2015.

arXiv:1507.00674 [pdf, other]

A Characterization of the Complexity of Resilience and Responsibility for Self-join-free Conjunctive Queries

Authors: Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

Abstract: Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect… ▽ More Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards. △ Less

Submitted 2 July, 2015; originally announced July 2015.

Comments: 36 pages, 13 figures

arXiv:1502.04956 [pdf, other]

The Linearization of Belief Propagation on Pairwise Markov Networks

Authors: Wolfgang Gatterbauer

Abstract: Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the u… ▽ More Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speeding-up inference by orders of magnitude. △ Less

Submitted 27 December, 2016; v1 submitted 17 February, 2015; originally announced February 2015.

Comments: Full version of AAAI 2017 paper with same title (23 pages, 9 figures)

arXiv:1412.3100 [pdf, other]

Semi-Supervised Learning with Heterophily

Authors: Wolfgang Gatterbauer

Abstract: We derive a family of linear inference algorithms that generalize existing graph-based label propagation algorithms by allowing them to propagate generalized assumptions about "attraction" or "compatibility" between classes of neighboring nodes (in particular those that involve heterophily between nodes where "opposites attract"). We thus call this formulation Semi-Supervised Learning with Heterop… ▽ More We derive a family of linear inference algorithms that generalize existing graph-based label propagation algorithms by allowing them to propagate generalized assumptions about "attraction" or "compatibility" between classes of neighboring nodes (in particular those that involve heterophily between nodes where "opposites attract"). We thus call this formulation Semi-Supervised Learning with Heterophily (SSLH) and show how it generalizes and improves upon a recently proposed approach called Linearized Belief Propagation (LinBP). Importantly, our framework allows us to reduce the problem of estimating the relative compatibility between nodes from partially labeled graph to a simple optimization problem. The result is a very fast algorithm that -- despite its simplicity -- is surprisingly effective: we can classify unlabeled nodes within the same graph in the same time as LinBP but with a superior accuracy and despite our algorithm not knowing the compatibilities. △ Less

Submitted 27 December, 2016; v1 submitted 9 December, 2014; originally announced December 2014.

Comments: 17 pages, 13 figures

arXiv:1412.1069 [pdf, other]

Approximate Lifted Inference with Probabilistic Databases

Authors: Wolfgang Gatterbauer, Dan Suciu

Abstract: This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerat… ▽ More This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. △ Less

Submitted 2 December, 2014; originally announced December 2014.

Comments: 12 pages, 5 figures, pre-print for a paper appearing in VLDB 2015. arXiv admin note: text overlap with arXiv:1310.6257

arXiv:1409.6052 [pdf, other]

doi 10.1145/2532641

Oblivious Bounds on the Probability of Boolean Functions

Authors: Wolfgang Gatterbauer, Dan Suciu

Abstract: This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our mot… ▽ More This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time. △ Less

Submitted 21 September, 2014; originally announced September 2014.

Comments: 34 pages, 14 figures, supersedes: http://arxiv.longhoe.net/abs/1105.2813

Journal ref: Pre-print for ACM Transactions on Database Systems, January 2014, Vol 39, No 1, Article 5

arXiv:1406.7288 [pdf, other]

Linearized and Single-Pass Belief Propagation

Authors: Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos

Abstract: How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely us… ▽ More How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. One main problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a "localized" version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard △ Less

Submitted 16 October, 2014; v1 submitted 27 June, 2014; originally announced June 2014.

Comments: 17 pages, 11 figures, 4 algorithms. Includes following major changes since v1: renaming of "turbo BP" to "single-pass BP", convergence criteria now give sufficient *and* necessary conditions, more detailed experiments, more detailed comparison with prior BP convergence results, overall improved exposition

arXiv:1310.6257 [pdf, other]

Dissociation and Propagation for Approximate Lifted Inference with Standard Relational Database Management Systems

Authors: Wolfgang Gatterbauer, Dan Suciu

Abstract: Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database… ▽ More Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases. △ Less

Submitted 14 June, 2016; v1 submitted 23 October, 2013; originally announced October 2013.

Comments: 33 pages, 27 figures, pre-print for VLDBJ full version of arXiv:1412.1069 [PVLDB 8(5):629-640, 2015: "Approximate lifted inference with probabilistic databases", http://www.vldb.org/pvldb/vol8/p629-gatterbauer.pdf ]. Former working title: "Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases"

arXiv:1105.4395 [pdf, other]

Default-all is dangerous!

Authors: Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu

Abstract: We show that the default-all propagation scheme for database annotations is dangerous. Dangerous here means that it can propagate annotations to the query output which are semantically irrelevant to the query the user asked. This is the result of considering all relationally equivalent queries and returning the union of their where-provenance in an attempt to define a propagation scheme that is in… ▽ More We show that the default-all propagation scheme for database annotations is dangerous. Dangerous here means that it can propagate annotations to the query output which are semantically irrelevant to the query the user asked. This is the result of considering all relationally equivalent queries and returning the union of their where-provenance in an attempt to define a propagation scheme that is insensitive to query rewriting. We propose an alternative query-rewrite-insensitive (QRI) where-provenance called minimum propagation. It is analogous to the minimum witness basis for why-provenance, straight-forward to evaluate, and returns all relevant and only relevant annotations. △ Less

Submitted 22 May, 2011; originally announced May 2011.

Comments: 4 pages, 6 figures, preprint of paper appearing in Proceedings of TaPP '11 (3rd USENIX Workshop on the Theory and Practice of Provenance); for details see the project page: http://db.cs.washington.edu/causality/

arXiv:1105.2813 [pdf, other]

Optimal Upper and Lower Bounds for Boolean Expressions by Dissociation

Authors: Wolfgang Gatterbauer, Dan Suciu

Abstract: This paper develops upper and lower bounds for the probability of Boolean expressions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. Our technique generalizes and extends the underlying idea of a number of recent approaches which are varyingly called node splitting, variable renaming, variable splitting, or dissociation for probabilist… ▽ More This paper develops upper and lower bounds for the probability of Boolean expressions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. Our technique generalizes and extends the underlying idea of a number of recent approaches which are varyingly called node splitting, variable renaming, variable splitting, or dissociation for probabilistic databases. We prove that the probabilities we assign to new variables are the best possible in some sense. △ Less

Submitted 13 May, 2011; originally announced May 2011.

Comments: 7 pages, 2 figures; for details see the project page: http://LaPushDB.com/

arXiv:1012.3502 [pdf, other]

Rules of Thumb for Information Acquisition from Large and Redundant Data

Authors: Wolfgang Gatterbauer

Abstract: We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which provide information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We… ▽ More We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which provide information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to "truncated distributions" with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate substantially from a strict power law. We further give one particular family of powerlaw functions that remain completely invariant under sampling. Finally, we validate our model with two large Web data sets: link distributions to domains and tag distributions on delicious.com. △ Less

Submitted 15 December, 2010; originally announced December 2010.

Comments: 40 pages, 17 figures; for details see the project page: http://uniquerecall.com

Journal ref: Full version of upcoming ECIR 2011 conference paper

arXiv:1012.3320 [pdf, other]

doi 10.1145/1807167.1807193

Data Conflict Resolution Using Trust Map**s

Authors: Wolfgang Gatterbauer, Dan Suciu

Abstract: In massively collaborative projects such as scientific or community databases, users often need to agree or disagree on the content of individual data items. On the other hand, trust relationships often exist between users, allowing them to accept or reject other users' beliefs by default. As those trust relationships become complex, however, it becomes difficult to define and compute a consistent… ▽ More In massively collaborative projects such as scientific or community databases, users often need to agree or disagree on the content of individual data items. On the other hand, trust relationships often exist between users, allowing them to accept or reject other users' beliefs by default. As those trust relationships become complex, however, it becomes difficult to define and compute a consistent snapshot of the conflicting information. Previous solutions to a related problem, the update reconciliation problem, are dependent on the order in which the updates are processed and, therefore, do not guarantee a globally consistent snapshot. This paper proposes the first principled solution to the automatic conflict resolution problem in a community database. Our semantics is based on the certain tuples of all stable models of a logic program. While evaluating stable models in general is well known to be hard, even for very simple logic programs, we show that the conflict resolution problem admits a PTIME solution. To the best of our knowledge, ours is the first PTIME algorithm that allows conflict resolution in a principled way. We further discuss extensions to negative beliefs and prove that some of these extensions are hard. This work is done in the context of the BeliefDB project at the University of Washington, which focuses on the efficient management of conflicts in community databases. △ Less

Submitted 15 December, 2010; originally announced December 2010.

Comments: 20 pages, 19 figures

Report number: University of Washington CSE Technical Report 09-11-01

Journal ref: Full version of SIGMOD 2010 conference paper, pp. 219-230

arXiv:1009.2021 [pdf, other]

The Complexity of Causality and Responsibility for Query Answers and non-Answers

Authors: Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

Abstract: An answer to a query has a well-defined lineage expression (alternatively called how-provenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a non-answer to a query. However, the cause of an answer or non-answer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Choc… ▽ More An answer to a query has a well-defined lineage expression (alternatively called how-provenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a non-answer to a query. However, the cause of an answer or non-answer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Chockler's recent definitions of causality and responsibility to define the causes of answers and non-answers to queries, and their degree of responsibility. Responsibility captures the notion of degree of causality and serves to rank potentially many causes by their relative contributions to the effect. Then, we study the complexity of computing causes and responsibilities for conjunctive queries. It is known that computing causes is NP-complete in general. Our first main result shows that all causes to conjunctive queries can be computed by a relational query which may involve negation. Thus, causality can be computed in PTIME, and very efficiently so. Next, we study computing responsibility. Here, we prove that the complexity depends on the conjunctive query and demonstrate a dichotomy between PTIME and NP-complete cases. For the PTIME cases, we give a non-trivial algorithm, consisting of a reduction to the max-flow computation problem. Finally, we prove that, even when it is in PTIME, responsibility is complete for LOGSPACE, implying that, unlike causality, it cannot be computed by a relational query. △ Less

Submitted 29 September, 2011; v1 submitted 10 September, 2010; originally announced September 2010.

Comments: 15 pages, 12 figures, PVLDB 2011

arXiv:0912.5340 [pdf, other]

Why so? or Why no? Functional Causality for Explaining Query Answers

Authors: Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

Abstract: In this paper, we propose causality as a unified framework to explain query answers and non-answers, thus generalizing and extending several previously proposed approaches of provenance and missing query result explanations. We develop our framework starting from the well-studied definition of actual causes by Halpern and Pearl. After identifying some undesirable characteristics of the origina… ▽ More In this paper, we propose causality as a unified framework to explain query answers and non-answers, thus generalizing and extending several previously proposed approaches of provenance and missing query result explanations. We develop our framework starting from the well-studied definition of actual causes by Halpern and Pearl. After identifying some undesirable characteristics of the original definition, we propose functional causes as a refined definition of causality with several desirable properties. These properties allow us to apply our notion of causality in a database context and apply it uniformly to define the causes of query results and their individual contributions in several ways: (i) we can model both provenance as well as non-answers, (ii) we can define explanations as either data in the input relations or relational operations in a query plan, and (iii) we can give graded degrees of responsibility to individual causes, thus allowing us to rank causes. In particular, our approach allows us to explain contributions to relational aggregate functions and to rank causes according to their respective responsibilities. We give complexity results and describe polynomial algorithms for evaluating causality in tractable cases. Throughout the paper, we illustrate the applicability of our framework with several examples. Overall, we develop in this paper the theoretical foundations of causality theory in a database context. △ Less

Submitted 29 December, 2009; originally announced December 2009.

Comments: 18 pages, 15 figures

Report number: University of Washington CSE Technical Report 09-12-01 ACM Class: H.2.1

arXiv:0912.5241 [pdf, other]

Believe It or Not: Adding Belief Annotations to Databases

Authors: Wolfgang Gatterbauer, Magdalena Balazinska, Nodira Khoussainova, Dan Suciu

Abstract: We propose a database model that allows users to annotate data with belief statements. Our motivation comes from scientific database applications where a community of users is working together to assemble, revise, and curate a shared data repository. As the community accumulates knowledge and the database content evolves over time, it may contain conflicting information and members can disagree… ▽ More We propose a database model that allows users to annotate data with belief statements. Our motivation comes from scientific database applications where a community of users is working together to assemble, revise, and curate a shared data repository. As the community accumulates knowledge and the database content evolves over time, it may contain conflicting information and members can disagree on the information it should store. For example, Alice may believe that a tuple should be in the database, whereas Bob disagrees. He may also insert the reason why he thinks Alice believes the tuple should be in the database, and explain what he thinks the correct tuple should be instead. We propose a formal model for Belief Databases that interprets users' annotations as belief statements. These annotations can refer both to the base data and to other annotations. We give a formal semantics based on a fragment of multi-agent epistemic logic and define a query language over belief databases. We then prove a key technical result, stating that every belief database can be encoded as a canonical Kripke structure. We use this structure to describe a relational representation of belief databases, and give an algorithm for translating queries over the belief database into standard relational queries. Finally, we report early experimental results with our prototype implementation on synthetic data. △ Less

Submitted 30 December, 2009; originally announced December 2009.

Comments: 17 pages, 10 figures

Report number: University of Washington CSE Technical Report 08-12-01 ACM Class: H.2.1

Journal ref: Full version of: VLDB 2009 conference version; PVLDB 2(1):1-12 (2009)

arXiv:0909.1778 [pdf]

A Case for A Collaborative Query Management System

Authors: Nodira Khoussainova, Magda Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu

Abstract: Over the past 40 years, database management systems (DBMSs) have evolved to provide a sophisticated variety of data management capabilities. At the same time, tools for managing queries over the data have remained relatively primitive. One reason for this is that queries are typically issued through applications. They are thus debugged once and re-used repeatedly. This mode of interaction, howev… ▽ More Over the past 40 years, database management systems (DBMSs) have evolved to provide a sophisticated variety of data management capabilities. At the same time, tools for managing queries over the data have remained relatively primitive. One reason for this is that queries are typically issued through applications. They are thus debugged once and re-used repeatedly. This mode of interaction, however, is changing. As scientists (and others) store and share increasingly large volumes of data in data centers, they need the ability to analyze the data by issuing exploratory queries. In this paper, we argue that, in these new settings, data management systems must provide powerful query management capabilities, from query browsing to automatic query recommendations. We first discuss the requirements for a collaborative query management system. We outline an early system architecture and discuss the many research challenges associated with building such an engine. △ Less

Submitted 9 September, 2009; originally announced September 2009.

Comments: CIDR 2009

Showing 1–40 of 40 results for author: Gatterbauer, W