-
Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries
Authors:
Yihao Hu,
Amir Gilad,
Kristin Stephens-Martinez,
Sudeepa Roy,
Jun Yang
Abstract:
We describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the target. It is particularly useful in an educational setting, where novices can receive help from Qr-Hint without requiring extensive personal tutoring. Sinc…
▽ More
We describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the target. It is particularly useful in an educational setting, where novices can receive help from Qr-Hint without requiring extensive personal tutoring. Since there are many different ways to write a correct query, we do not want to base our hints completely on how Q* is written; instead, starting with the user's own working query, Qr-Hint purposefully guides the user through a sequence of steps that provably lead to a correct query, which will be equivalent to Q* but may still "look" quite different from it. Ideally, we would like Qr-Hint's hints to lead to the "smallest" possible corrections to Q. However, optimality is not always achievable in this case due to some foundational hurdles such as the undecidability of SQL query equivalence and the complexity of logic minimization. Nonetheless, by carefully decomposing and formulating the problems and develo** principled solutions, we are able to provide provably correct and locally optimal hints through Qr-Hint. We show the effectiveness of Qr-Hint through quality and performance experiments as well as a user study in an educational setting.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Stability Analysis of Non-Linear Classifiers using Gene Regulatory Neural Network for Biological AI
Authors:
Adrian Ratwatte,
Samitha Somathilaka,
Sasitharan Balasubramaniam,
Assaf A. Gilad
Abstract:
The Gene Regulatory Network (GRN) of biological cells governs a number of key functionalities that enables them to adapt and survive through different environmental conditions. Close observation of the GRN shows that the structure and operational principles resembles an Artificial Neural Network (ANN), which can pave the way for the development of Biological Artificial Intelligence. In particular,…
▽ More
The Gene Regulatory Network (GRN) of biological cells governs a number of key functionalities that enables them to adapt and survive through different environmental conditions. Close observation of the GRN shows that the structure and operational principles resembles an Artificial Neural Network (ANN), which can pave the way for the development of Biological Artificial Intelligence. In particular, a gene's transcription and translation process resembles a sigmoidal-like property based on transcription factor inputs. In this paper, we develop a mathematical model of gene-perceptron using a dual-layered transcription-translation chemical reaction model, enabling us to transform a GRN into a Gene Regulatory Neural Network (GRNN). We perform stability analysis for each gene-perceptron within the fully-connected GRNN sub network to determine temporal as well as stable concentration outputs that will result in reliable computing performance. We focus on a non-linear classifier application for the GRNN, where we analyzed generic multi-layer GRNNs as well as E.Coli GRNN that is derived from trans-omic experimental data. Our analysis found that varying the parameters of the chemical reactions can allow us shift the boundaries of the classification region, laying the platform for programmable GRNNs that suit diverse application requirements.
△ Less
Submitted 14 September, 2023;
originally announced October 2023.
-
DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms
Authors:
Shweta Patwa,
Danyu Sun,
Amir Gilad,
Ashwin Machanavajjhala,
Sudeepa Roy
Abstract:
Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. Whe…
▽ More
Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
The Consistency of Probabilistic Databases with Independent Cells
Authors:
Amir Gilad,
Aviram Imber,
Benny Kimelfeld
Abstract:
A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the context of noisy operations such as missing data imputation, where we automatically fill in missing values, column prediction, where we predict unknown attributes,…
▽ More
A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the context of noisy operations such as missing data imputation, where we automatically fill in missing values, column prediction, where we predict unknown attributes, and database cleaning (and repairing), where we replace the original values due to detected errors or violation of integrity constraints. We study the computational complexity of problems that regard the selection of cell values in the presence of integrity constraints. More precisely, we focus on functional dependencies and study three problems: (1) deciding whether the constraints can be satisfied by any choice of values, (2) finding a most probable such choice, and (3) calculating the probability of satisfying the constraints. The data complexity of these problems is determined by the combination of the set of functional dependencies and the collection of uncertain attributes. We give full classifications into tractable and intractable complexities for several classes of constraints, including a single dependency, matching constraints, and unary functional dependencies.
△ Less
Submitted 22 December, 2022;
originally announced December 2022.
-
PreFair: Privately Generating Justifiably Fair Synthetic Data
Authors:
David Pujol,
Amir Gilad,
Ashwin Machanavajjhala
Abstract:
When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data…
▽ More
When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data generation. However, such systems may preserve or even magnify properties of the data that make it unfair, endering the synthetic data unfit for use. In this work, we present PreFair, a system that allows for DP fair synthetic data generation. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. We further study the problem of generating DP fair synthetic data, showing its intractability and designing algorithms that are optimal under certain assumptions. We also provide an extensive experimental evaluation, showing that PreFair generates synthetic data that is significantly fairer than the data generated by leading DP data generation mechanisms, while remaining faithful to the private data.
△ Less
Submitted 27 March, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
FEDEX: An Explainability Framework for Data Exploration Steps
Authors:
Daniel Deutch,
Amir Gilad,
Tova Milo,
Amit Mualem,
Amit Somech
Abstract:
When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is b…
▽ More
When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is based on the contribution of each row to the interestingness of different columns of the entire dataframe, which, in turn, is defined using standard measures such as diversity and exceptionality. Intuitively, interesting rows are ones that explain why (some column of) the analysis query result is interesting as a whole. Rows are correlated in their contribution and so the interesting score for a set of rows may not be directly computed based on that of individual rows. We address the resulting computational challenge by restricting attention to semantically-related sets, based on multiple notions of semantic relatedness; these sets serve as more informative explanations. Our experimental study across multiple real-world datasets shows the usefulness of our system in various scenarios.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
DPXPlain: Privately Explaining Aggregate Query Answers
Authors:
Yuchao Tao,
Amir Gilad,
Ashwin Machanavajjhala,
Sudeepa Roy
Abstract:
Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today's era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to…
▽ More
Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today's era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? In the second case, even the observation made by the users on query results may be wrong. In the first case, can we still mine interesting explanations from the sensitive data while protecting its privacy? To address these challenges, we present a three-phase framework DPXPlain, which is the first system to the best of our knowledge for explaining group-by aggregate query answers with DP. In its three phases, DPXPlain (a) answers a group-by aggregate query with DP, (b) allows users to compare aggregate values of two groups and with high probability assesses whether this comparison holds or is flipped by the DP noise, and (c) eventually provides an explanation table containing the approximately `top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps. We perform an extensive experimental analysis of DPXPlain with multiple use-cases on real and synthetic data showing that DPXPlain efficiently provides insightful explanations with good accuracy and utility.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach
Authors:
Sainyam Galhotra,
Amir Gilad,
Sudeepa Roy,
Babak Salimi
Abstract:
What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan strategies in their fields. Typically, such analyses are done by testing the effect of an update in the existing database on a specific view create…
▽ More
What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan strategies in their fields. Typically, such analyses are done by testing the effect of an update in the existing database on a specific view created by a query of interest. In real-world scenarios, however, an update to a particular part of the database may affect tuples and attributes in a completely different part due to implicit semantic dependencies. To allow for hypothetical reasoning while accommodating such dependencies, we develop HypeR, a framework that supports what-if and how-to queries accounting for probabilistic dependencies among attributes captured by a probabilistic causal model. We extend the SQL syntax to include the necessary operators for expressing these hypothetical queries, define their semantics, devise efficient algorithms and optimizations to compute their results using concepts from causality and probabilistic databases, and evaluate the effectiveness of our approach experimentally.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Understanding Queries by Conditional Instances
Authors:
Amir Gilad,
Zhengjie Miao,
Sudeepa Roy,
Jun Yang
Abstract:
A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant to understanding the query but also cause cognitive overload; and one specific database may not be enough. Given a relational query, is it possib…
▽ More
A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant to understanding the query but also cause cognitive overload; and one specific database may not be enough. Given a relational query, is it possible to provide a simple and generic "representative" instance that (1) illustrates how the query can be satisfied, (2) summarizes all specific instances that would satisfy the query in the same way by abstracting away unnecessary details? Furthermore, is it possible to find a collection of such representative instances that together completely characterize all possible ways in which the query can be satisfied? This paper takes initial steps towards answering these questions. We design what these representative instances look like, define what they stand for, and formalize what it means for them to satisfy a query in "all possible ways." We argue that this problem is undecidable for general domain relational calculus queries, and develop practical algorithms for computing a minimum collection of such instances subject to other constraints. We evaluate the efficiency of our approach experimentally, and show its effectiveness in hel** users debug relational queries through a user study.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Using Genetic Programming to Predict and Optimize Protein Function
Authors:
Iliya Miralavy,
Alexander Bricco,
Assaf Gilad,
Wolfgang Banzhaf
Abstract:
Protein engineers conventionally use tools such as Directed Evolution to find new proteins with better functionalities and traits. More recently, computational techniques and especially machine learning approaches have been recruited to assist Directed Evolution, showing promising results. In this paper, we propose POET, a computational Genetic Programming tool based on evolutionary computation me…
▽ More
Protein engineers conventionally use tools such as Directed Evolution to find new proteins with better functionalities and traits. More recently, computational techniques and especially machine learning approaches have been recruited to assist Directed Evolution, showing promising results. In this paper, we propose POET, a computational Genetic Programming tool based on evolutionary computation methods to enhance screening and mutagenesis in Directed Evolution and help protein engineers to find proteins that have better functionality. As a proof-of-concept we use peptides that generate MRI contrast detected by the Chemical Exchange Saturation Transfer contrast mechanism. The evolutionary methods used in POET are described, and the performance of POET in different epochs of our experiments with Chemical Exchange Saturation Transfer contrast are studied. Our results indicate that a computational modelling tool like POET can help to find peptides with 400% better functionality than used before.
△ Less
Submitted 22 February, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
Heterogeneous Treatment Effects in Social Networks
Authors:
Amir Gilad,
Harsh Parikh,
Sudeepa Roy,
Babak Salimi
Abstract:
We study treatment effect modifiers for causal analysis in a social network, where neighbors' characteristics or network structure may affect the outcome of a unit, and the goal is to identify sub-populations with varying treatment effects using such network properties. We propose a novel framework for this purpose that facilitates data-driven decision making by testing hypotheses about complex ef…
▽ More
We study treatment effect modifiers for causal analysis in a social network, where neighbors' characteristics or network structure may affect the outcome of a unit, and the goal is to identify sub-populations with varying treatment effects using such network properties. We propose a novel framework for this purpose that facilitates data-driven decision making by testing hypotheses about complex effect modifiers in terms of network features or network patterns (e.g., characteristics of neighbors of a unit or belonging to a triangle), and by identifying sub-populations for which a treatment is likely to be effective or harmful. We describe a hypothesis testing approach that accounts for a unit's covariates, their neighbors' covariates, and patterns in the social network, and devise an algorithm incorporating ideas from causal inference, hypothesis testing, and graph theory to verify a hypothesized effect modifier. In addition, we develop a novel algorithm for the discovery of network patterns that are potential effect modifiers. We perform extensive experimental evaluations with a real development economics dataset about the treatment effect of belonging to a financial support network called self-help groups on risk tolerance, and also with a synthetic dataset with known ground truths simulating a vaccine efficacy trial, to evaluate our framework and algorithms.
△ Less
Submitted 7 November, 2021; v1 submitted 21 May, 2021;
originally announced May 2021.
-
Synthesizing Linked Data Under Cardinality and Integrity Constraints
Authors:
Amir Gilad,
Shweta Patwa,
Ashwin Machanavajjhala
Abstract:
The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key…
▽ More
The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key ($FK$) values, and two types of constraints: (1) CCs that apply to the join view and (2) ICs that apply to the table with missing $FK$ values, our goal is to impute the missing $FK$ values such that the constraints are satisfied. We provide a novel framework for the problem based on declarative CCs and ICs. We further show that the problem is NP-hard and propose a novel two-phase solution that guarantees the satisfaction of the ICs. Phase I yields an intermediate solution accounting for the CCs alone, and relies on a hybrid approach based on CC types. For one type, the problem is modeled as an Integer Linear Program. For the others, we describe an efficient and accurate solution. We then combine the two solutions. Phase II augments this solution by incorporating the ICs and uses a coloring of the conflict hypergraph to infer the values of the $FK$ column. Our extensive experimental study shows that our solution scales well when the data and number of constraints increases. We further show that our solution maintains low error rates for the CCs.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
On Optimizing the Trade-off between Privacy and Utility in Data Provenance
Authors:
Daniel Deutch,
Ariel Frankenthal,
Amir Gilad,
Yuval Moskovitch
Abstract:
Organizations that collect and analyze data may wish or be mandated by regulation to justify and explain their analysis results. At the same time, the logic that they have followed to analyze the data, i.e., their queries, may be proprietary and confidential. Data provenance, a record of the transformations that data underwent, was extensively studied as means of explanations. In contrast, only a…
▽ More
Organizations that collect and analyze data may wish or be mandated by regulation to justify and explain their analysis results. At the same time, the logic that they have followed to analyze the data, i.e., their queries, may be proprietary and confidential. Data provenance, a record of the transformations that data underwent, was extensively studied as means of explanations. In contrast, only a few works have studied the tension between disclosing provenance and hiding the underlying query.
This tension is the focus of the present paper, where we formalize and explore for the first time the tradeoff between the utility of presenting provenance information and the breach of privacy it poses with respect to the underlying query. Intuitively, our formalization is based on the notion of provenance abstraction, where the representation of some tuples in the provenance expressions is abstracted in a way that makes multiple tuples indistinguishable. The privacy of a chosen abstraction is then measured based on how many queries match the obfuscated provenance, in the same vein as k-anonymity. The utility is measured based on the entropy of the abstraction, intuitively how much information is lost with respect to the actual tuples participating in the provenance. Our formalization yields a novel optimization problem of choosing the best abstraction in terms of this tradeoff. We show that the problem is intractable in general, but design greedy heuristics that exploit the provenance structure towards a practically efficient exploration of the search space. We experimentally prove the effectiveness of our solution using the TPC-H benchmark and the IMDB dataset.
△ Less
Submitted 27 February, 2021;
originally announced March 2021.
-
Towards Inferring Queries from Simple and Partial Provenance Examples
Authors:
Amir Gilad,
Yuval Moskovitch
Abstract:
The field of query-by-example aims at inferring queries from output examples given by non-expert users, by finding the underlying logic that binds the examples. However, for a very small set of examples, it is difficult to correctly infer such logic. To bridge this gap, previous work suggested attaching explanations to each output example, modeled as provenance, allowing users to explain the reaso…
▽ More
The field of query-by-example aims at inferring queries from output examples given by non-expert users, by finding the underlying logic that binds the examples. However, for a very small set of examples, it is difficult to correctly infer such logic. To bridge this gap, previous work suggested attaching explanations to each output example, modeled as provenance, allowing users to explain the reason behind their choice of example. In this paper, we explore the problem of inferring queries from a few output examples and intuitive explanations. We propose a two step framework: (1) convert the explanations into (partial) provenance and (2) infer a query that generates the output examples using a novel algorithm that employs a graph based approach. This framework is suitable for non-experts as it does not require the specification of the provenance in its entirety or an understanding of its structure. We show promising initial experimental results of our approach.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Explaining Natural Language Query Results
Authors:
Daniel Deutch,
Nave Frost,
Amir Gilad
Abstract:
Multiple lines of research have developed Natural Language (NL) interfaces for formulating database queries. We build upon this work, but focus on presenting a highly detailed form of the answers in NL. The answers that we present are importantly based on the provenance of tuples in the query result, detailing not only the results but also their explanations. We develop a novel method for transfor…
▽ More
Multiple lines of research have developed Natural Language (NL) interfaces for formulating database queries. We build upon this work, but focus on presenting a highly detailed form of the answers in NL. The answers that we present are importantly based on the provenance of tuples in the query result, detailing not only the results but also their explanations. We develop a novel method for transforming provenance information to NL, by leveraging the original NL query structure. Furthermore, since provenance information is typically large and complex, we present two solutions for its effective presentation as NL text: one that is based on provenance factorization, with novel desiderata relevant to the NL case, and one that is based on summarization. We have implemented our solution in an end-to-end system supporting questions, answers and provenance, all expressed in NL. Our experiments, including a user study, indicate the quality of our solution and its scalability.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
T-REx: Table Repair Explanations
Authors:
Daniel Deutch,
Nave Frost,
Amir Gilad,
Oren Sheffer
Abstract:
Data repair is a common and crucial step in many frameworks today, as applications may use data from different sources and of different levels of credibility. Thus, this step has been the focus of many works, proposing diverse approaches. To assist users in understanding the output of such data repair algorithms, we propose T-REx, a system for providing data repair explanations through Shapley val…
▽ More
Data repair is a common and crucial step in many frameworks today, as applications may use data from different sources and of different levels of credibility. Thus, this step has been the focus of many works, proposing diverse approaches. To assist users in understanding the output of such data repair algorithms, we propose T-REx, a system for providing data repair explanations through Shapley values. The system is generic and not specific to a given repair algorithm or approach: it treats the algorithm as a black box. Given a specific table cell selected by the user, T-REx employs Shapley values to explain the significance of each constraint and each table cell in the repair of the cell of interest. T-REx then ranks the constraints and table cells according to their importance in the repair of this cell. This explanation allows users to understand the repair process, as well as to act based on this knowledge, to modify the most influencing constraints or the original database.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
On Multiple Semantics for Declarative Database Repairs
Authors:
Amir Gilad,
Daniel Deutch,
Sudeepa Roy
Abstract:
We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta Rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allowing to capture Database Triggers of interest. We show that there are no one-size-fits-all semantics for repairs in this inclusive setting, and…
▽ More
We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta Rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allowing to capture Database Triggers of interest. We show that there are no one-size-fits-all semantics for repairs in this inclusive setting, and we consequently introduce multiple alternative semantics, presenting the case for using each of them. We then study the relationships between the semantics in terms of their output and the complexity of computation. Our results formally establish the tradeoff between the permissiveness of the semantics and its computational complexity. We demonstrate the usefulness of the framework in capturing multiple data repair scenarios for an Academic Search database and the TPC-H databases, showing how using different semantics affects the repair in terms of size and runtime, and examining the relationships between the repairs. We also compare our approach with SQL triggers and a state-of-the-art data repair system.
△ Less
Submitted 12 April, 2020; v1 submitted 10 April, 2020;
originally announced April 2020.
-
Query By Provenance
Authors:
Daniel Deutch,
Amir Gilad
Abstract:
To assist non-specialists in formulating database queries, multiple frameworks that automatically infer queries from a set of examples have been proposed. While highly useful, a shortcoming of the approach is that if users can only provide a small set of examples, many inherently different queries may qualify, and only some of these actually match the user intentions. Our main observation is that…
▽ More
To assist non-specialists in formulating database queries, multiple frameworks that automatically infer queries from a set of examples have been proposed. While highly useful, a shortcoming of the approach is that if users can only provide a small set of examples, many inherently different queries may qualify, and only some of these actually match the user intentions. Our main observation is that if users further explain their examples, the set of qualifying queries may be significantly more focused. We develop a novel framework where users explain example tuples by choosing input tuples that are intuitively the "cause" for their examples. Their explanations are automatically "compiled" into a formal model for explanations, based on previously developed models of data provenance. Then, our novel algorithms infer conjunctive queries from the examples and their explanations. We prove the computational efficiency of the algorithms and favorable properties of inferred queries. We have further implemented our solution in a system prototype with an interface that assists users in formulating explanations in an intuitive way. Our experimental results, including a user study as well as experiments using the TPC-H benchmark, indicate the effectiveness of our solution.
△ Less
Submitted 16 May, 2016; v1 submitted 11 February, 2016;
originally announced February 2016.