-
The Distributional Uncertainty of the SHAP score in Explainable Machine Learning
Authors:
Santiago Cifuentes,
Leopoldo Bertossi,
Nina Pardal,
Sergio Abriola,
Maria Vanina Martinez,
Miguel Romero
Abstract:
Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is g…
▽ More
Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
The Shapley Value in Database Management
Authors:
Leopoldo Bertossi,
Benny Kimelfeld,
Ester Livshits,
Mikaël Monet
Abstract:
Attribution scores can be applied in data management to quantify the contribution of individual items to conclusions from the data, as part of the explanation of what led to these conclusions. In Artificial Intelligence, Machine Learning, and Data Management, some of the common scores are deployments of the Shapley value, a formula for profit sharing in cooperative game theory. Since its invention…
▽ More
Attribution scores can be applied in data management to quantify the contribution of individual items to conclusions from the data, as part of the explanation of what led to these conclusions. In Artificial Intelligence, Machine Learning, and Data Management, some of the common scores are deployments of the Shapley value, a formula for profit sharing in cooperative game theory. Since its invention in the 1950s, the Shapley value has been used for contribution measurement in many fields, from economics to law, with its latest researched applications in modern machine learning. Recent studies investigated the application of the Shapley value to database management. This article gives an overview of recent results on the computational complexity of the Shapley value for measuring the contribution of tuples to query answers and to the extent of inconsistency with respect to integrity constraints. More specifically, the article highlights lower and upper bounds on the complexity of calculating the Shapley value, either exactly or approximately, as well as solutions for realizing the calculation in practice.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Attribution-Scores in Data Management and Explainable Machine Learning
Authors:
Leopoldo Bertossi
Abstract:
We describe recent research on the use of actual causality in the definition of responsibility scores as explanations for query answers in databases, and for outcomes from classification models in machine learning. In the case of databases, useful connections with database repairs are illustrated and exploited. Repairs are also used to give a quantitative measure of the consistency of a database.…
▽ More
We describe recent research on the use of actual causality in the definition of responsibility scores as explanations for query answers in databases, and for outcomes from classification models in machine learning. In the case of databases, useful connections with database repairs are illustrated and exploited. Repairs are also used to give a quantitative measure of the consistency of a database. For classification models, the responsibility score is properly extended and illustrated. The efficient computation of Shap-score is also analyzed and discussed. The emphasis is placed on work done by the author and collaborators.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
From Database Repairs to Causality in Databases and Beyond
Authors:
Leopoldo Bertossi
Abstract:
We describe some recent approaches to score-based explanations for query answers in databases. The focus is on work done by the author and collaborators. Special emphasis is placed on the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
We describe some recent approaches to score-based explanations for query answers in databases. The focus is on work done by the author and collaborators. Special emphasis is placed on the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Efficient Computation of Shap Explanation Scores for Neural Network Classifiers via Knowledge Compilation
Authors:
Leopoldo Bertossi,
Jorge E. Leon
Abstract:
The use of Shap scores has become widespread in Explainable AI. However, their computation is in general intractable, in particular when done with a black-box classifier, such as neural network. Recent research has unveiled classes of open-box Boolean Circuit classifiers for which Shap can be computed efficiently. We show how to transform binary neural networks into those circuits for efficient Sh…
▽ More
The use of Shap scores has become widespread in Explainable AI. However, their computation is in general intractable, in particular when done with a black-box classifier, such as neural network. Recent research has unveiled classes of open-box Boolean Circuit classifiers for which Shap can be computed efficiently. We show how to transform binary neural networks into those circuits for efficient Shap computation.We use logic-based knowledge compilation techniques. The performance gain is huge, as we show in the light of our experiments.
△ Less
Submitted 22 July, 2023; v1 submitted 11 March, 2023;
originally announced March 2023.
-
Attribution-Scores and Causal Counterfactuals as Explanations in Artificial Intelligence
Authors:
Leopoldo Bertossi
Abstract:
In this expository article we highlight the relevance of explanations for artificial intelligence, in general, and for the newer developments in {\em explainable AI}, referring to origins and connections of and among different approaches. We describe in simple terms, explanations in data management and machine learning that are based on attribution-scores, and counterfactuals as found in the area…
▽ More
In this expository article we highlight the relevance of explanations for artificial intelligence, in general, and for the newer developments in {\em explainable AI}, referring to origins and connections of and among different approaches. We describe in simple terms, explanations in data management and machine learning that are based on attribution-scores, and counterfactuals as found in the area of causality. We elaborate on the importance of logical reasoning when dealing with counterfactuals, and their use for score computation.
△ Less
Submitted 22 March, 2023; v1 submitted 5 March, 2023;
originally announced March 2023.
-
Answer-Set Programs for Repair Updates and Counterfactual Interventions
Authors:
Leopoldo Bertossi
Abstract:
We briefly describe -- mainly through very simple examples -- different kinds of answer-set programs with annotations that have been proposed for specifying: database repairs and consistent query answering; secrecy view and query evaluation with them; counterfactual interventions for causality in databases; and counterfactual-based explanations in machine learning.
We briefly describe -- mainly through very simple examples -- different kinds of answer-set programs with annotations that have been proposed for specifying: database repairs and consistent query answering; secrecy view and query evaluation with them; counterfactual interventions for causality in databases; and counterfactual-based explanations in machine learning.
△ Less
Submitted 24 September, 2022;
originally announced September 2022.
-
Reasoning about Counterfactuals and Explanations: Problems, Results and Directions
Authors:
Leopoldo Bertossi
Abstract:
There are some recent approaches and results about the use of answer-set programming for specifying counterfactual interventions on entities under classification, and reasoning about them. These approaches are flexible and modular in that they allow the seamless addition of domain knowledge. Reasoning is enabled by query answering from the answer-set program. The programs can be used to specify an…
▽ More
There are some recent approaches and results about the use of answer-set programming for specifying counterfactual interventions on entities under classification, and reasoning about them. These approaches are flexible and modular in that they allow the seamless addition of domain knowledge. Reasoning is enabled by query answering from the answer-set program. The programs can be used to specify and compute responsibility-based numerical scores as attributive explanations for classification results.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
Second-Order Specifications and Quantifier Elimination for Consistent Query Answering in Databases
Authors:
Leopoldo Bertossi
Abstract:
Consistent answers to a query from a possibly inconsistent database are answers that are simultaneously retrieved from every possible repair of the database. Repairs are consistent instances that minimally differ from the original inconsistent instance. It has been shown before that database repairs can be specified as the stable models of a disjunctive logic program. In this paper we show how to…
▽ More
Consistent answers to a query from a possibly inconsistent database are answers that are simultaneously retrieved from every possible repair of the database. Repairs are consistent instances that minimally differ from the original inconsistent instance. It has been shown before that database repairs can be specified as the stable models of a disjunctive logic program. In this paper we show how to use the repair programs to transform the problem of consistent query answering into a problem of reasoning w.r.t. a theory written in second-order predicate logic. It also investigated how a first-order theory can be obtained instead by applying second-order quantifier elimination techniques.
△ Less
Submitted 18 October, 2021; v1 submitted 18 August, 2021;
originally announced August 2021.
-
Extending Sticky-Datalog+/- via Finite-Position Selection Functions: Tractability, Algorithms, and Optimization
Authors:
Leopoldo Bertossi,
Mostafa Milani
Abstract:
Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimi…
▽ More
Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimizations have been open. In order to fill this gap, we first study Sticky and WS programs from the point of view of the behavior of the chase procedure. We extend the stickiness property of the chase to that of generalized stickiness of the chase (GSCh) modulo an oracle that selects (and provides) the predicate positions where finitely values appear during the chase. Stickiness modulo a selection function S that provides only a subset of those positions defines sch(S), a semantic subclass of GSCh. Program classes with selection functions include Sticky and WS, and another syntactic class that we introduce and characterize, namely JWS, of jointly-weakly-sticky programs, which contains WS. The selection functions for these last three classes are computable, and no external, possibly non-computable oracle is needed. We propose a bottom-up QA algorithm for programs in the class sch(S), for a general selection function S. As a particular case, we obtain a polynomial-time QA algorithm for JWS and weakly-sticky programs. Unlike WS, JWS turns out to be closed under magic-sets query optimization. As a consequence, both the generic polynomial-time QA algorithm and its magic-set optimization can be particularized and applied to WS.
△ Less
Submitted 2 August, 2021; v1 submitted 2 August, 2021;
originally announced August 2021.
-
Answer-Set Programs for Reasoning about Counterfactual Interventions and Responsibility Scores for Classification
Authors:
Leopoldo Bertossi,
Gabriela Reyes
Abstract:
We describe how answer-set programs can be used to declaratively specify counterfactual interventions on entities under classification, and reason about them. In particular, they can be used to define and compute responsibility scores as attribution-based explanations for outcomes from classification models. The approach allows for the inclusion of domain knowledge and supports query answering. A…
▽ More
We describe how answer-set programs can be used to declaratively specify counterfactual interventions on entities under classification, and reason about them. In particular, they can be used to define and compute responsibility scores as attribution-based explanations for outcomes from classification models. The approach allows for the inclusion of domain knowledge and supports query answering. A detailed example with a naive-Bayes classifier is presented.
△ Less
Submitted 1 September, 2021; v1 submitted 21 July, 2021;
originally announced July 2021.
-
Score-Based Explanations in Data Management and Machine Learning: An Answer-Set Programming Approach to Counterfactual Analysis
Authors:
Leopoldo Bertossi
Abstract:
We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that…
▽ More
We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
△ Less
Submitted 19 September, 2021; v1 submitted 19 June, 2021;
originally announced June 2021.
-
On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results
Authors:
Marcelo Arenas,
Pablo Barceló,
Leopoldo Bertossi,
Mikaël Monet
Abstract:
In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decom…
▽ More
In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees and Ordered Binary Decision Diagrams (OBDDs).
We also establish the computational limits of the SHAP-score by observing that computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider. It also implies that computing $\mathsf{SHAP}$-scores is intractable as well over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing $\mathsf{SHAP}$-scores over such class. In contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists for the computation of $\mathsf{SHAP}$-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula $\varphi$ and features $x,y$, whether the $\mathsf{SHAP}$-score of $x$ in $\varphi$ is smaller than the $\mathsf{SHAP}$-score of $y$ in $\varphi$.
△ Less
Submitted 30 March, 2023; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Declarative Approaches to Counterfactual Explanations for Classification
Authors:
Leopoldo Bertossi
Abstract:
We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely "responsibility scores". T…
▽ More
We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely "responsibility scores". The approach and the programs can be applied with black-box models, and also with models that can be specified as logic programs, such as rule-based classifiers. The main focus of this work is on the specification and computation of "best" counterfactual entities, i.e. those that lead to maximum responsibility scores. From them one can read off the explanations as maximum responsibility feature values in the original entity. We also extend the programs to bring into the picture semantic or domain knowledge. We show how the approach could be extended by means of probabilistic methods, and how the underlying probability distributions could be modified through the use of constraints. Several examples of programs written in the syntax of the DLV ASP-solver, and run with it, are shown.
△ Less
Submitted 7 December, 2021; v1 submitted 14 November, 2020;
originally announced November 2020.
-
The Tractability of SHAP-Score-Based Explanations over Deterministic and Decomposable Boolean Circuits
Authors:
Marcelo Arenas,
Pablo Barceló Leopoldo Bertossi,
Mikaël Monet
Abstract:
Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractab…
▽ More
Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, it has recently been claimed that the SHAP-score can be computed in polynomial time over the class of decision trees. In this paper, we provide a proof of a stronger result over Boolean models: the SHAP-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits, also known as tractable Boolean circuits, generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). We also establish the computational limits of the notion of SHAP-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the SHAP-score intractable (namely, #P-hard).
△ Less
Submitted 3 April, 2021; v1 submitted 28 July, 2020;
originally announced July 2020.
-
Score-Based Explanations in Data Management and Machine Learning
Authors:
Leopoldo Bertossi
Abstract:
We describe some approaches to explanations for observed outcomes in data management and machine learning. They are based on the assignment of numerical scores to predefined and potentially relevant inputs. More specifically, we consider explanations for query answers in databases, and for results from classification models. The described approaches are mostly of a causal and counterfactual nature…
▽ More
We describe some approaches to explanations for observed outcomes in data management and machine learning. They are based on the assignment of numerical scores to predefined and potentially relevant inputs. More specifically, we consider explanations for query answers in databases, and for results from classification models. The described approaches are mostly of a causal and counterfactual nature. We argue for the need to bring domain and semantic knowledge into score computations; and suggest some ways to do this.
△ Less
Submitted 18 August, 2020; v1 submitted 24 July, 2020;
originally announced July 2020.
-
An ASP-Based Approach to Counterfactual Explanations for Classification
Authors:
Leopoldo Bertossi
Abstract:
We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus in on the specification and computation of maximum responsibility causal explanati…
▽ More
We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus in on the specification and computation of maximum responsibility causal explanations. The use of additional semantic knowledge is investigated.
△ Less
Submitted 15 June, 2020; v1 submitted 27 April, 2020;
originally announced April 2020.
-
Causality-based Explanation of Classification Outcomes
Authors:
Leopoldo Bertossi,
Jordan Li,
Maximilian Schleich,
Dan Suciu,
Zografoula Vagena
Abstract:
We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain.
We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain.
△ Less
Submitted 25 May, 2020; v1 submitted 15 March, 2020;
originally announced March 2020.
-
The Shapley Value of Tuples in Query Answering
Authors:
Ester Livshits,
Leopoldo Bertossi,
Benny Kimelfeld,
Moshe Sebag
Abstract:
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single…
▽ More
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.
△ Less
Submitted 1 September, 2021; v1 submitted 18 April, 2019;
originally announced April 2019.
-
Repair-Based Degrees of Database Inconsistency: Computation and Complexity
Authors:
Leopoldo Bertossi
Abstract:
We propose a generic numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. In particular, an inconsistency measure associated to cardinality-repairs is investigated in detail. More specifically, it is shown that it can be computed via answer-set programs, but sometimes its computation can be intractable in…
▽ More
We propose a generic numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. In particular, an inconsistency measure associated to cardinality-repairs is investigated in detail. More specifically, it is shown that it can be computed via answer-set programs, but sometimes its computation can be intractable in data complexity. However, polynomial-time deterministic and randomized approximations are exhibited. The behavior of this measure under small updates is analyzed, obtaining fixed-parameter tractability results. Furthermore, alternative inconsistency measures are proposed and discussed.
△ Less
Submitted 22 January, 2019; v1 submitted 26 September, 2018;
originally announced September 2018.
-
Measuring and Computing Database Inconsistency via Repairs
Authors:
Leopoldo Bertossi
Abstract:
We propose a generic numerical measure of inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. A particular inconsistency measure associated to cardinality-repairs is investigated; and we show that it can be computed via answer-set programs.
Keywords: Integrity constraints in databases, inconsistent databases, database repairs,…
▽ More
We propose a generic numerical measure of inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. A particular inconsistency measure associated to cardinality-repairs is investigated; and we show that it can be computed via answer-set programs.
Keywords: Integrity constraints in databases, inconsistent databases, database repairs, inconsistency measure.
△ Less
Submitted 12 July, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
Datalog: Bag Semantics via Set Semantics
Authors:
Leopoldo Bertossi,
Georg Gottlob,
Reinhard Pichler
Abstract:
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called {\em warded Datalog}$^\pm$, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a prac…
▽ More
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called {\em warded Datalog}$^\pm$, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog$^\pm$ is extended to give a set semantics to duplicates in Datalog$^\pm$ itself. We investigate the properties of the resulting Datalog$^\pm$ programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics.
△ Less
Submitted 12 February, 2019; v1 submitted 16 March, 2018;
originally announced March 2018.
-
Specifying and Computing Causes for Query Answers in Databases via Database Repairs and Repair Programs
Authors:
Leopoldo Bertossi
Abstract:
A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute le…
▽ More
A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute level by appealing to a both null-based and attribute-based repair semantics. The corresponding repair programs are presented, and they are used as a basis for computation and reasoning about attribute-level causes. They are extended to deal with the case of causality under integrity constraints.
△ Less
Submitted 28 September, 2020; v1 submitted 4 December, 2017;
originally announced December 2017.
-
The Causality/Repair Connection in Databases: Causality-Programs
Authors:
Leopoldo Bertossi
Abstract:
In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes for query answers from databases.
In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes for query answers from databases.
△ Less
Submitted 26 June, 2017; v1 submitted 17 April, 2017;
originally announced April 2017.
-
Ontological Multidimensional Data Models and Contextual Data Qality
Authors:
Leopoldo Bertossi,
Mostafa Milani
Abstract:
Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow…
▽ More
Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.
△ Less
Submitted 13 August, 2017; v1 submitted 31 March, 2017;
originally announced April 2017.
-
The Ontological Multidimensional Data Model
Authors:
Leopoldo Bertossi,
Mostafa Milani
Abstract:
In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some go…
▽ More
In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some good computational properties of the model.
△ Less
Submitted 3 May, 2017; v1 submitted 9 March, 2017;
originally announced March 2017.
-
Enforcing Relational Matching Dependencies with Datalog for Entity Resolution
Authors:
Zeinab Bahmani,
Leopoldo Bertossi
Abstract:
Entity resolution (ER) is about identifying and merging records in a database that represent the same real-world entity. Matching dependencies (MDs) have been introduced and investigated as declarative rules that specify ER policies. An ER process induced by MDs over a dirty instance leads to multiple clean instances, in general. General "answer sets programs" have been proposed to specify the MD-…
▽ More
Entity resolution (ER) is about identifying and merging records in a database that represent the same real-world entity. Matching dependencies (MDs) have been introduced and investigated as declarative rules that specify ER policies. An ER process induced by MDs over a dirty instance leads to multiple clean instances, in general. General "answer sets programs" have been proposed to specify the MD-based cleaning task and its results. In this work, we extend MDs to "relational MDs", which capture more application semantics, and identify classes of relational MDs for which the general ASP can be automatically rewritten into a stratified Datalog program, with the single clean instance as its standard model.
△ Less
Submitted 25 February, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Causes for Query Answers from Databases: Datalog Abduction, View-Updates, and Integrity Constraints
Authors:
Leopoldo Bertossi,
Babak Salimi
Abstract:
Causality has been recently introduced in databases, to model, characterize, and possibly compute causes for query answers. Connections between QA-causality and consistency-based diagnosis and database repairs (wrt. integrity constraint violations) have already been established. In this work we establish precise connections between QA-causality and both abductive diagnosis and the view-update prob…
▽ More
Causality has been recently introduced in databases, to model, characterize, and possibly compute causes for query answers. Connections between QA-causality and consistency-based diagnosis and database repairs (wrt. integrity constraint violations) have already been established. In this work we establish precise connections between QA-causality and both abductive diagnosis and the view-update problem in databases, allowing us to obtain new algorithmic and complexity results for QA-causality. We also obtain new results on the complexity of view-conditioned causality, and investigate the notion of QA-causality in the presence of integrity constraints, obtaining complexity results from a connection with view-conditioned causality. The abduction connection under integrity constraints allows us to obtain algorithmic tools for QA-causality.
△ Less
Submitted 31 July, 2017; v1 submitted 5 November, 2016;
originally announced November 2016.
-
Contexts and Data Quality Assessment
Authors:
Leopoldo Bertossi,
Flavio Rizzolo
Abstract:
The quality of data is context dependent. Starting from this intuition and experience, we propose and develop a conceptual framework that captures in formal terms the notion of "context-dependent data quality". We start by proposing a generic and abstract notion of context, and also of its uses, in general and in data management in particular. On this basis, we investigate "data quality assessment…
▽ More
The quality of data is context dependent. Starting from this intuition and experience, we propose and develop a conceptual framework that captures in formal terms the notion of "context-dependent data quality". We start by proposing a generic and abstract notion of context, and also of its uses, in general and in data management in particular. On this basis, we investigate "data quality assessment" and "quality query answering" as context-dependent activities. A context for the assessment of a database D at hand is modeled as an external database schema, with possibly materialized or virtual data, and connections to external data sources. The database D is put in context via map**s to the contextual schema, which produces a collection C of alternative clean versions of D. The quality of D is measured in terms of its distance to C. The class C} is also used to define and do "quality query answering". The proposed model allows for natural extensions, like the use of data quality predicates, the optimization of the access by the context to external data sources, and also the representation of contexts by means of more expressive ontologies.
△ Less
Submitted 14 August, 2016;
originally announced August 2016.
-
Extending Weakly-Sticky Datalog+/-: Query-Answering Tractability and Optimizations
Authors:
Mostafa Milani,
Leopoldo Bertossi
Abstract:
Weakly-sticky (WS) Datalog+/- is an expressive member of the family of Datalog+/- programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs…
▽ More
Weakly-sticky (WS) Datalog+/- is an expressive member of the family of Datalog+/- programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs from the point of view of the behavior of the chase procedure, extending the stickiness property of the chase to that of generalized stickiness of the chase (gsch-property). With this property we specify the semantic class of GSCh programs, which includes sticky and WS programs, and other syntactic subclasses that we identify. In particular, we introduce joint-weakly-sticky (JWS) programs, that include WS programs. We also propose a bottom-up QA algorithm for a range of subclasses of GSCh. The algorithm runs in polynomial time (in data) for JWS programs. Unlike the WS class, JWS is closed under a general magic-sets rewriting procedure for the optimization of programs with existential rules. We apply the magic-sets rewriting in combination with the proposed QA algorithm for the optimization of QA over JWS programs.
△ Less
Submitted 9 July, 2016;
originally announced July 2016.
-
Consistency and Trust in Peer Data Exchange Systems
Authors:
Leopoldo Bertossi,
Loreto Bravo
Abstract:
We propose and investigate a semantics for "peer data exchange systems" where different peers are related by data exchange constraints and trust relationships. These two elements plus the data at the peers' sites and their local integrity constraints are made compatible via a semantics that characterizes sets of "solution instances" for the peers. They are the intended -possibly virtual- instances…
▽ More
We propose and investigate a semantics for "peer data exchange systems" where different peers are related by data exchange constraints and trust relationships. These two elements plus the data at the peers' sites and their local integrity constraints are made compatible via a semantics that characterizes sets of "solution instances" for the peers. They are the intended -possibly virtual- instances for a peer that are obtained through a data repair semantics that we introduce and investigate. The semantically correct answers from a peer to a query, the so-called "peer consistent answers", are defined as those answers that are invariant under all its different solution instances. We show that solution instances can be specified as the models of logic programs with a stable model semantics. The repair semantics is based on null values as used in SQL databases, and is also of independent interest for repairs of single databases with respect to integrity constraints.
△ Less
Submitted 6 June, 2016;
originally announced June 2016.
-
Complexity of Consistent Query Answering in Databases under Cardinality-Based and Incremental Repair Semantics (extended version)
Authors:
Andrei Lopatenko,
Leopoldo Bertossi
Abstract:
A database D may be inconsistent wrt a given set IC of integrity constraints. Consistent Query Answering (CQA) is the problem of computing from D the answers to a query that are consistent wrt IC . Consistent answers are invariant under all the repairs of D, i.e. the consistent instances that minimally depart from D. Three classes of repair have been considered in the literature: those that minimi…
▽ More
A database D may be inconsistent wrt a given set IC of integrity constraints. Consistent Query Answering (CQA) is the problem of computing from D the answers to a query that are consistent wrt IC . Consistent answers are invariant under all the repairs of D, i.e. the consistent instances that minimally depart from D. Three classes of repair have been considered in the literature: those that minimize set-theoretically the set of tuples in the symmetric difference; those that minimize the changes of attribute values, and those that minimize the cardinality of the set of tuples in the symmetric difference. The latter class has not been systematically investigated. In this paper we obtain algorithmic and complexity theoretic results for CQA under this cardinality-based repair semantics. We do this in the usual, static setting, but also in a dynamic framework where a consistent database is affected by a sequence of updates, which may make it inconsistent. We also establish comparative results with the other two kinds of repairs in the dynamic case.
△ Less
Submitted 23 May, 2016;
originally announced May 2016.
-
A Hybrid Approach to Query Answering under Expressive Datalog+/-
Authors:
Mostafa Milani,
Andrea Cali,
Leopoldo Bertossi
Abstract:
Datalog+/- is a family of ontology languages that combine good computational properties with high expressive power. Datalog+/- languages are provably able to capture the most relevant Semantic Web languages. In this paper we consider the class of weakly-sticky (WS) Datalog+/- programs, which allow for certain useful forms of joins in rule bodies as well as extending the well-known class of weakly-…
▽ More
Datalog+/- is a family of ontology languages that combine good computational properties with high expressive power. Datalog+/- languages are provably able to capture the most relevant Semantic Web languages. In this paper we consider the class of weakly-sticky (WS) Datalog+/- programs, which allow for certain useful forms of joins in rule bodies as well as extending the well-known class of weakly-acyclic TGDs. So far, only non-deterministic algorithms were known for answering queries on WS Datalog+/- programs. We present novel deterministic query answering algorithms under WS Datalog+/-. In particular, we propose: (1) a bottom-up grounding algorithm based on a query-driven chase, and (2) a hybrid approach based on transforming a WS program into a so-called sticky one, for which query rewriting techniques are known. We discuss how our algorithms can be optimized and effectively applied for query answering in real-world scenarios.
△ Less
Submitted 25 July, 2016; v1 submitted 22 April, 2016;
originally announced April 2016.
-
Quantifying Causal Effects on Query Answering in Databases
Authors:
Babak Salimi,
Leopoldo Bertossi,
Dan Suciu,
Guy Van den Broeck
Abstract:
The notion of actual causation, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute actual causes for possibly unexpected answers to monotone queries. Causes take the form of database tuples, and can be ranked according to their causal responsibility, a numerical measure of their relevance as a cause to the query answer. In this work w…
▽ More
The notion of actual causation, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute actual causes for possibly unexpected answers to monotone queries. Causes take the form of database tuples, and can be ranked according to their causal responsibility, a numerical measure of their relevance as a cause to the query answer. In this work we revisit this notion, introducing and making a case for an alternative measure of causal contribution, that of causal effect. The measure generalizes actual causes, and can be applied beyond monotone queries. We show that causal effect provides intuitive and intended results.
△ Less
Submitted 24 April, 2016; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Causes for Query Answers from Databases, Datalog Abduction and View-Updates: The Presence of Integrity Constraints
Authors:
Babak Salimi,
Leopoldo Bertossi
Abstract:
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between queryanswer causality, consistency-based diagnosis, database repairs (wrt. integrity constraint violations), abductive diagnosis and the view-update problem have been established. In this work we further investigate connections between query-answe…
▽ More
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between queryanswer causality, consistency-based diagnosis, database repairs (wrt. integrity constraint violations), abductive diagnosis and the view-update problem have been established. In this work we further investigate connections between query-answer causality and abductive diagnosis and the view-update problem. In this context, we also define and investigate the notion of query-answer causality in the presence of integrity constraints.
△ Less
Submitted 20 February, 2016;
originally announced February 2016.
-
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Authors:
Zeinab Bahmani,
Leopoldo Bertossi,
Nikolaos Vasiloglou
Abstract:
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this…
▽ More
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs.
△ Less
Submitted 18 January, 2017; v1 submitted 6 February, 2016;
originally announced February 2016.
-
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Authors:
Zeinab Bahmani,
Leopoldo Bertossi,
Nikolaos Vasiloglou
Abstract:
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this wo…
▽ More
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.
△ Less
Submitted 24 August, 2015;
originally announced August 2015.
-
From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back
Authors:
Leopoldo Bertossi,
Babak Salimi
Abstract:
In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality p…
▽ More
In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problems of computing actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for database causality.
△ Less
Submitted 23 October, 2016; v1 submitted 1 July, 2015;
originally announced July 2015.
-
Query-Answer Causality in Databases: Abductive Diagnosis and View-Updates
Authors:
Babak Salimi,
Leopoldo Bertossi
Abstract:
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-…
▽ More
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-update problem. The unveiled relationships allow us to obtain new complexity results for query causality -the main focus of our work- and also for the two other areas.
△ Less
Submitted 19 September, 2015; v1 submitted 13 June, 2015;
originally announced June 2015.
-
Tractable Query Answering and Optimization for Extensions of Weakly-Sticky Datalog+-
Authors:
Mostafa Milani,
Leopoldo Bertossi
Abstract:
We consider a semantic class, weakly-chase-sticky (WChS), and a syntactic subclass, jointly-weakly-sticky (JWS), of Datalog+- programs. Both extend that of weakly-sticky (WS) programs, which appear in our applications to data quality. For WChS programs we propose a practical, polynomial-time query answering algorithm (QAA). We establish that the two classes are closed under magic-sets rewritings.…
▽ More
We consider a semantic class, weakly-chase-sticky (WChS), and a syntactic subclass, jointly-weakly-sticky (JWS), of Datalog+- programs. Both extend that of weakly-sticky (WS) programs, which appear in our applications to data quality. For WChS programs we propose a practical, polynomial-time query answering algorithm (QAA). We establish that the two classes are closed under magic-sets rewritings. As a consequence, QAA can be applied to the optimized programs. QAA takes as inputs the program (including the query) and semantic information about the "finiteness" of predicate positions. For the syntactic subclasses JWS and WS of WChS, this additional information is computable.
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back
Authors:
Babak Salimi,
Leopoldo Bertossi
Abstract:
In this work we establish and investigate connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. Causality probl…
▽ More
In this work we establish and investigate connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problem of determining actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for causality in databases.
△ Less
Submitted 13 December, 2014;
originally announced December 2014.
-
Unifying Causality, Diagnosis, Repairs and View-Updates in Databases
Authors:
Leopoldo Bertossi,
Babak Salimi
Abstract:
In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they…
▽ More
In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they have in common. In one way or another, these are all approaches to uncertainty management, which becomes even more relevant in the context of big data that have to be made sense of.
△ Less
Submitted 28 June, 2014; v1 submitted 16 May, 2014;
originally announced May 2014.
-
Causality in Databases: The Diagnosis and Repair Connections
Authors:
Babak Salimi,
Leopoldo Bertossi
Abstract:
In this work we establish and investigate the connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. The vast bo…
▽ More
In this work we establish and investigate the connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. The vast body of research on database repairs can be applied to the newer problem of determining actual causes for query answers. By formulating a causality problem as a diagnosis problem, we manage to characterize causes in terms of a system's diagnoses.
△ Less
Submitted 28 June, 2014; v1 submitted 27 April, 2014;
originally announced April 2014.
-
Extending Contexts with Ontologies for Multidimensional Data Quality Assessment
Authors:
Mostafa Milani,
Leopoldo Bertossi,
Sina Ariyan
Abstract:
Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and…
▽ More
Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and processing, enabling quality assessment. In this work we extend contexts with dimensions, and by doing so, we make possible a multidimensional assessment of data quality assessment. Multidimensional contexts are represented as ontologies written in Datalog+-. We use this language for representing dimensional constraints, and dimensional rules, and also for doing query answering based on dimensional navigation, which becomes an important auxiliary activity in the assessment of data. We show ideas and mechanisms by means of examples.
△ Less
Submitted 20 January, 2014; v1 submitted 27 December, 2013;
originally announced December 2013.
-
Tractable vs. Intractable Cases of Matching Dependencies for Query Answering under Entity Resolution
Authors:
Leopoldo Bertossi,
Jaffer Gardezi
Abstract:
Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, on the basis of similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them.…
▽ More
Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, on the basis of similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The resolved answers to queries are those that are invariant under the resulting class of resolved instances. Previous work identified certain classes of queries and sets of MDs for which resolved query answering is tractable. Special emphasis was placed on cyclic sets of MDs. In this work we further investigate the complexity of this problem, identifying intractable cases, and exploring the frontier between tractability and intractability. We concentrate mostly on acyclic sets of MDs. For a special case we obtain a dichotomy result relative to NP-hardness.
△ Less
Submitted 6 April, 2014; v1 submitted 7 September, 2013;
originally announced September 2013.
-
On the Complexity of Query Answering under Matching Dependencies for Entity Resolution
Authors:
Leopoldo Bertossi,
Jaffer Gardezi
Abstract:
Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The re…
▽ More
Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The resolved answers to queries are those that are invariant under the resulting class of resolved instances. In previous work we identified some tractable cases (i.e. for certain classes of queries and MDs) of resolved query answering. In this paper we further investigate the complexity of this problem, identifying some intractable cases. For a special case we obtain a dichotomy complexity result.
△ Less
Submitted 26 May, 2013; v1 submitted 30 April, 2013;
originally announced April 2013.
-
Query Answering under Matching Dependencies for Data Cleaning: Complexity and Algorithms
Authors:
Jaffer Gardezi,
Leopoldo Bertossi
Abstract:
Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance $D$. A set of MDs can be used as the basis for a possibly non-deterministic mechanism that computes a duplicate-free instance from $D$. The possible results of this process are the clean, "minimally resolved instances" (MRIs…
▽ More
Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance $D$. A set of MDs can be used as the basis for a possibly non-deterministic mechanism that computes a duplicate-free instance from $D$. The possible results of this process are the clean, "minimally resolved instances" (MRIs). There might be several MRIs for $D$, and the "resolved answers" to a query are those that are shared by all the MRIs. We investigate the problem of computing resolved answers. We look at various sets of MDs, develo** syntactic criteria for determining (in)tractability of the resolved answer problem, including a dichotomy result. For some tractable classes of MDs and conjunctive queries, we present a query rewriting methodology that can be used to retrieve the resolved answers. We also investigate connections with "consistent query answering", deriving further tractability results for MD-based ER.
△ Less
Submitted 26 December, 2011;
originally announced December 2011.
-
Consistent Query Answering under Spatial Semantic Constraints
Authors:
M. Andrea Rodríguez,
Leopoldo Bertossi,
Monica Caniupan
Abstract:
Consistent query answering is an inconsistency tolerant approach to obtaining semantically correct answers from a database that may be inconsistent with respect to its integrity constraints. In this work we formalize the notion of consistent query answer for spatial databases and spatial semantic integrity constraints. In order to do this, we first characterize conflicting spatial data, and next,…
▽ More
Consistent query answering is an inconsistency tolerant approach to obtaining semantically correct answers from a database that may be inconsistent with respect to its integrity constraints. In this work we formalize the notion of consistent query answer for spatial databases and spatial semantic integrity constraints. In order to do this, we first characterize conflicting spatial data, and next, we define admissible instances that restore consistency while staying close to the original instance. In this way we obtain a repair semantics, which is used as an instrumental concept to define and possibly derive consistent query answers. We then concentrate on a class of spatial denial constraints and spatial queries for which there exists an efficient strategy to compute consistent query answers. This study applies inconsistency tolerance in spatial databases, rising research issues that shift the goal from the consistency of a spatial database to the consistency of query answering.
△ Less
Submitted 7 June, 2011;
originally announced June 2011.
-
Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates
Authors:
Leopoldo Bertossi,
Lechen Li
Abstract:
There may be sensitive information in a relational database, and we might want to keep it hidden from a user or group thereof. In this work, sensitive data is characterized as the contents of a set of secrecy views. For a user without permission to access that sensitive data, the database instance he queries is updated to make the contents of the views empty or contain only tuples with null values…
▽ More
There may be sensitive information in a relational database, and we might want to keep it hidden from a user or group thereof. In this work, sensitive data is characterized as the contents of a set of secrecy views. For a user without permission to access that sensitive data, the database instance he queries is updated to make the contents of the views empty or contain only tuples with null values. In particular, if this user poses a query about any of these views, no meaningful information is returned. Since the database is not expected to be physically changed to produce this result, the updates are only virtual. And also minimal in a precise way. These minimal updates are reflected in the secrecy view contents, and also in the fact that query answers, while being privacy preserving, are also maximally informative. Virtual updates are based on the use of null values as used in the SQL standard. We provide the semantics of secrecy views and the virtual updates. The different ways in which the underlying database is virtually updated are specified as the models of a logic program with stable model semantics. The program becomes the basis for the computation of the "secret answers" to queries, i.e. those that do not reveal the sensitive information.
△ Less
Submitted 5 April, 2012; v1 submitted 6 May, 2011;
originally announced May 2011.
-
Matching Dependencies with Arbitrary Attribute Values: Semantics, Query Answering and Integrity Constraints
Authors:
Jaffer Gardezi,
Leopoldo Bertossi,
Iluju Kiringa
Abstract:
Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the "pure case" of MDs, any value from the underlying data domain can be used for the value in common that does the…
▽ More
Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the "pure case" of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the "clean answers" to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs and queries are identified. Finally, we establish connections with database "repairs" under integrity constraints.
△ Less
Submitted 26 August, 2010;
originally announced August 2010.