Search | arXiv e-print repository

The Distributional Uncertainty of the SHAP score in Explainable Machine Learning

Authors: Santiago Cifuentes, Leopoldo Bertossi, Nina Pardal, Sergio Abriola, Maria Vanina Martinez, Miguel Romero

Abstract: Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is g… ▽ More Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring. △ Less

Submitted 23 January, 2024; originally announced January 2024.

MSC Class: 68T37; 68T27

arXiv:2401.06234 [pdf, other]

doi 10.1145/3615952.3615954

The Shapley Value in Database Management

Authors: Leopoldo Bertossi, Benny Kimelfeld, Ester Livshits, Mikaël Monet

Abstract: Attribution scores can be applied in data management to quantify the contribution of individual items to conclusions from the data, as part of the explanation of what led to these conclusions. In Artificial Intelligence, Machine Learning, and Data Management, some of the common scores are deployments of the Shapley value, a formula for profit sharing in cooperative game theory. Since its invention… ▽ More Attribution scores can be applied in data management to quantify the contribution of individual items to conclusions from the data, as part of the explanation of what led to these conclusions. In Artificial Intelligence, Machine Learning, and Data Management, some of the common scores are deployments of the Shapley value, a formula for profit sharing in cooperative game theory. Since its invention in the 1950s, the Shapley value has been used for contribution measurement in many fields, from economics to law, with its latest researched applications in modern machine learning. Recent studies investigated the application of the Shapley value to database management. This article gives an overview of recent results on the computational complexity of the Shapley value for measuring the contribution of tuples to query answers and to the extent of inconsistency with respect to integrity constraints. More specifically, the article highlights lower and upper bounds on the complexity of calculating the Shapley value, either exactly or approximately, as well as solutions for realizing the calculation in practice. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 12 pages, including references. This is the authors version of the corresponding SIGMOD Record article

Journal ref: SIGMOD Rec. 52(2): 6-17 (2023)

arXiv:2308.00184 [pdf, other]

Attribution-Scores in Data Management and Explainable Machine Learning

Authors: Leopoldo Bertossi

Abstract: We describe recent research on the use of actual causality in the definition of responsibility scores as explanations for query answers in databases, and for outcomes from classification models in machine learning. In the case of databases, useful connections with database repairs are illustrated and exploited. Repairs are also used to give a quantitative measure of the consistency of a database.… ▽ More We describe recent research on the use of actual causality in the definition of responsibility scores as explanations for query answers in databases, and for outcomes from classification models in machine learning. In the case of databases, useful connections with database repairs are illustrated and exploited. Repairs are also used to give a quantitative measure of the consistency of a database. For classification models, the responsibility score is properly extended and illustrated. The efficient computation of Shap-score is also analyzed and discussed. The emphasis is placed on work done by the author and collaborators. △ Less

Submitted 31 July, 2023; originally announced August 2023.

Comments: Paper associated to ADBIS23 tutorial. To appear. arXiv admin note: substantial text overlap with arXiv:2303.02829, arXiv:2106.10562

arXiv:2306.09374 [pdf, other]

From Database Repairs to Causality in Databases and Beyond

Authors: Leopoldo Bertossi

Abstract: We describe some recent approaches to score-based explanations for query answers in databases. The focus is on work done by the author and collaborators. Special emphasis is placed on the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown. We describe some recent approaches to score-based explanations for query answers in databases. The focus is on work done by the author and collaborators. Special emphasis is placed on the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: Contributed paper associated to keynote presentation at BDA 2022. To appear in special issue of Springer TLDKS. arXiv admin note: substantial text overlap with arXiv:2106.10562

arXiv:2303.06516 [pdf, other]

Efficient Computation of Shap Explanation Scores for Neural Network Classifiers via Knowledge Compilation

Authors: Leopoldo Bertossi, Jorge E. Leon

Abstract: The use of Shap scores has become widespread in Explainable AI. However, their computation is in general intractable, in particular when done with a black-box classifier, such as neural network. Recent research has unveiled classes of open-box Boolean Circuit classifiers for which Shap can be computed efficiently. We show how to transform binary neural networks into those circuits for efficient Sh… ▽ More The use of Shap scores has become widespread in Explainable AI. However, their computation is in general intractable, in particular when done with a black-box classifier, such as neural network. Recent research has unveiled classes of open-box Boolean Circuit classifiers for which Shap can be computed efficiently. We show how to transform binary neural networks into those circuits for efficient Shap computation.We use logic-based knowledge compilation techniques. The performance gain is huge, as we show in the light of our experiments. △ Less

Submitted 22 July, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

Comments: Substantial revision of previous version with the same title. To appear in conference proceedings. It replaces the previously uploaded paper "Opening Up the Neural Network Classifier for Shap Score Computation", by the same authors

arXiv:2303.02829 [pdf, other]

Attribution-Scores and Causal Counterfactuals as Explanations in Artificial Intelligence

Authors: Leopoldo Bertossi

Abstract: In this expository article we highlight the relevance of explanations for artificial intelligence, in general, and for the newer developments in {\em explainable AI}, referring to origins and connections of and among different approaches. We describe in simple terms, explanations in data management and machine learning that are based on attribution-scores, and counterfactuals as found in the area… ▽ More In this expository article we highlight the relevance of explanations for artificial intelligence, in general, and for the newer developments in {\em explainable AI}, referring to origins and connections of and among different approaches. We describe in simple terms, explanations in data management and machine learning that are based on attribution-scores, and counterfactuals as found in the area of causality. We elaborate on the importance of logical reasoning when dealing with counterfactuals, and their use for score computation. △ Less

Submitted 22 March, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

Comments: Submitted as chapter contribution. In this version some additional comments were added, and some wrong equation references corrected

arXiv:2209.12110 [pdf, ps, other]

Answer-Set Programs for Repair Updates and Counterfactual Interventions

Authors: Leopoldo Bertossi

Abstract: We briefly describe -- mainly through very simple examples -- different kinds of answer-set programs with annotations that have been proposed for specifying: database repairs and consistent query answering; secrecy view and query evaluation with them; counterfactual interventions for causality in databases; and counterfactual-based explanations in machine learning. We briefly describe -- mainly through very simple examples -- different kinds of answer-set programs with annotations that have been proposed for specifying: database repairs and consistent query answering; secrecy view and query evaluation with them; counterfactual interventions for causality in databases; and counterfactual-based explanations in machine learning. △ Less

Submitted 24 September, 2022; originally announced September 2022.

Comments: Submitted to Festschrift volume

arXiv:2108.11004 [pdf, ps, other]

Reasoning about Counterfactuals and Explanations: Problems, Results and Directions

Authors: Leopoldo Bertossi

Abstract: There are some recent approaches and results about the use of answer-set programming for specifying counterfactual interventions on entities under classification, and reasoning about them. These approaches are flexible and modular in that they allow the seamless addition of domain knowledge. Reasoning is enabled by query answering from the answer-set program. The programs can be used to specify an… ▽ More There are some recent approaches and results about the use of answer-set programming for specifying counterfactual interventions on entities under classification, and reasoning about them. These approaches are flexible and modular in that they allow the seamless addition of domain knowledge. Reasoning is enabled by query answering from the answer-set program. The programs can be used to specify and compute responsibility-based numerical scores as attributive explanations for classification results. △ Less

Submitted 24 August, 2021; originally announced August 2021.

Comments: To appear in informal proceedings of 2nd Workshop on Explainable Logic-Based Knowledge Representation (XLoKR 2021), co-located with KR 2021. arXiv admin note: substantial text overlap with arXiv:2107.10159

arXiv:2108.08423 [pdf, ps, other]

Second-Order Specifications and Quantifier Elimination for Consistent Query Answering in Databases

Authors: Leopoldo Bertossi

Abstract: Consistent answers to a query from a possibly inconsistent database are answers that are simultaneously retrieved from every possible repair of the database. Repairs are consistent instances that minimally differ from the original inconsistent instance. It has been shown before that database repairs can be specified as the stable models of a disjunctive logic program. In this paper we show how to… ▽ More Consistent answers to a query from a possibly inconsistent database are answers that are simultaneously retrieved from every possible repair of the database. Repairs are consistent instances that minimally differ from the original inconsistent instance. It has been shown before that database repairs can be specified as the stable models of a disjunctive logic program. In this paper we show how to use the repair programs to transform the problem of consistent query answering into a problem of reasoning w.r.t. a theory written in second-order predicate logic. It also investigated how a first-order theory can be obtained instead by applying second-order quantifier elimination techniques. △ Less

Submitted 18 October, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

Comments: A couple of minor mistakes corrected, and some explanations added

arXiv:2108.00903 [pdf, other]

Extending Sticky-Datalog+/- via Finite-Position Selection Functions: Tractability, Algorithms, and Optimization

Authors: Leopoldo Bertossi, Mostafa Milani

Abstract: Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimi… ▽ More Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimizations have been open. In order to fill this gap, we first study Sticky and WS programs from the point of view of the behavior of the chase procedure. We extend the stickiness property of the chase to that of generalized stickiness of the chase (GSCh) modulo an oracle that selects (and provides) the predicate positions where finitely values appear during the chase. Stickiness modulo a selection function S that provides only a subset of those positions defines sch(S), a semantic subclass of GSCh. Program classes with selection functions include Sticky and WS, and another syntactic class that we introduce and characterize, namely JWS, of jointly-weakly-sticky programs, which contains WS. The selection functions for these last three classes are computable, and no external, possibly non-computable oracle is needed. We propose a bottom-up QA algorithm for programs in the class sch(S), for a general selection function S. As a particular case, we obtain a polynomial-time QA algorithm for JWS and weakly-sticky programs. Unlike WS, JWS turns out to be closed under magic-sets query optimization. As a consequence, both the generic polynomial-time QA algorithm and its magic-set optimization can be particularized and applied to WS. △ Less

Submitted 2 August, 2021; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: Journal submission

arXiv:2107.10159 [pdf, other]

Answer-Set Programs for Reasoning about Counterfactual Interventions and Responsibility Scores for Classification

Authors: Leopoldo Bertossi, Gabriela Reyes

Abstract: We describe how answer-set programs can be used to declaratively specify counterfactual interventions on entities under classification, and reason about them. In particular, they can be used to define and compute responsibility scores as attribution-based explanations for outcomes from classification models. The approach allows for the inclusion of domain knowledge and supports query answering. A… ▽ More We describe how answer-set programs can be used to declaratively specify counterfactual interventions on entities under classification, and reason about them. In particular, they can be used to define and compute responsibility scores as attribution-based explanations for outcomes from classification models. The approach allows for the inclusion of domain knowledge and supports query answering. A detailed example with a naive-Bayes classifier is presented. △ Less

Submitted 1 September, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

Comments: Revised for camera ready. Extended version with appendices of paper to appear in IJCLR'21. arXiv admin note: text overlap with arXiv:2106.10562

arXiv:2106.10562 [pdf, other]

Score-Based Explanations in Data Management and Machine Learning: An Answer-Set Programming Approach to Counterfactual Analysis

Authors: Leopoldo Bertossi

Abstract: We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that… ▽ More We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown. △ Less

Submitted 19 September, 2021; v1 submitted 19 June, 2021; originally announced June 2021.

Comments: Revised version for camera ready. Typos corrected, new references, and a new section with background material added. Paper associated to forthcoming short course at Fall School. arXiv admin note: text overlap with arXiv:2007.12799

arXiv:2104.08015 [pdf, other]

On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results

Authors: Marcelo Arenas, Pablo Barceló, Leopoldo Bertossi, Mikaël Monet

Abstract: In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decom… ▽ More In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees and Ordered Binary Decision Diagrams (OBDDs). We also establish the computational limits of the SHAP-score by observing that computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider. It also implies that computing $\mathsf{SHAP}$-scores is intractable as well over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing $\mathsf{SHAP}$-scores over such class. In contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists for the computation of $\mathsf{SHAP}$-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula $\varphi$ and features $x,y$, whether the $\mathsf{SHAP}$-score of $x$ in $\varphi$ is smaller than the $\mathsf{SHAP}$-score of $y$ in $\varphi$. △ Less

Submitted 30 March, 2023; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Up to the formatting, this is the exact content of the paper in Journal of Machine Learning Research (JMLR)

arXiv:2011.07423 [pdf, other]

Declarative Approaches to Counterfactual Explanations for Classification

Authors: Leopoldo Bertossi

Abstract: We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely "responsibility scores". T… ▽ More We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely "responsibility scores". The approach and the programs can be applied with black-box models, and also with models that can be specified as logic programs, such as rule-based classifiers. The main focus of this work is on the specification and computation of "best" counterfactual entities, i.e. those that lead to maximum responsibility scores. From them one can read off the explanations as maximum responsibility feature values in the original entity. We also extend the programs to bring into the picture semantic or domain knowledge. We show how the approach could be extended by means of probabilistic methods, and how the underlying probability distributions could be modified through the use of constraints. Several examples of programs written in the syntax of the DLV ASP-solver, and run with it, are shown. △ Less

Submitted 7 December, 2021; v1 submitted 14 November, 2020; originally announced November 2020.

Comments: Camera-ready of journal version, with some final additions and revisions. Revised and considerably extended version of a RuleML-RR'20 paper [arXiv:2004.13237]. Submitted by invitation

arXiv:2007.14045 [pdf, ps, other]

The Tractability of SHAP-Score-Based Explanations over Deterministic and Decomposable Boolean Circuits

Authors: Marcelo Arenas, Pablo Barceló Leopoldo Bertossi, Mikaël Monet

Abstract: Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractab… ▽ More Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, it has recently been claimed that the SHAP-score can be computed in polynomial time over the class of decision trees. In this paper, we provide a proof of a stronger result over Boolean models: the SHAP-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits, also known as tractable Boolean circuits, generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). We also establish the computational limits of the notion of SHAP-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the SHAP-score intractable (namely, #P-hard). △ Less

Submitted 3 April, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: 17 pages, including 8 pages of main text. arXiv version of the AAAI'21 conference paper. Except from the addition of the technical appendix, the content is the same as the AAAI one

arXiv:2007.12799 [pdf, other]

Score-Based Explanations in Data Management and Machine Learning

Authors: Leopoldo Bertossi

Abstract: We describe some approaches to explanations for observed outcomes in data management and machine learning. They are based on the assignment of numerical scores to predefined and potentially relevant inputs. More specifically, we consider explanations for query answers in databases, and for results from classification models. The described approaches are mostly of a causal and counterfactual nature… ▽ More We describe some approaches to explanations for observed outcomes in data management and machine learning. They are based on the assignment of numerical scores to predefined and potentially relevant inputs. More specifically, we consider explanations for query answers in databases, and for results from classification models. The described approaches are mostly of a causal and counterfactual nature. We argue for the need to bring domain and semantic knowledge into score computations; and suggest some ways to do this. △ Less

Submitted 18 August, 2020; v1 submitted 24 July, 2020; originally announced July 2020.

Comments: Companion paper for a tutorial at the Scalable Uncertainty Management Conference (SUM'20). To appear in Proc. SUM'20. Minor fixes made

arXiv:2004.13237 [pdf, ps, other]

An ASP-Based Approach to Counterfactual Explanations for Classification

Authors: Leopoldo Bertossi

Abstract: We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus in on the specification and computation of maximum responsibility causal explanati… ▽ More We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus in on the specification and computation of maximum responsibility causal explanations. The use of additional semantic knowledge is investigated. △ Less

Submitted 15 June, 2020; v1 submitted 27 April, 2020; originally announced April 2020.

Comments: Revised and extended version. To appear in Proc. RuleML+RR, 2020

arXiv:2003.06868 [pdf, other]

Causality-based Explanation of Classification Outcomes

Authors: Leopoldo Bertossi, Jordan Li, Maximilian Schleich, Dan Suciu, Zografoula Vagena

Abstract: We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain. We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain. △ Less

Submitted 25 May, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

Comments: 16 pages, 6 figures, 1 table

arXiv:1904.08679 [pdf, other]

doi 10.46298/lmcs-17(3:22)2021

The Shapley Value of Tuples in Query Answering

Authors: Ester Livshits, Leopoldo Bertossi, Benny Kimelfeld, Moshe Sebag

Abstract: We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single… ▽ More We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms. △ Less

Submitted 1 September, 2021; v1 submitted 18 April, 2019; originally announced April 2019.

Journal ref: Logical Methods in Computer Science, Volume 17, Issue 3 (September 2, 2021) lmcs:6942

arXiv:1809.10286 [pdf, other]

Repair-Based Degrees of Database Inconsistency: Computation and Complexity

Authors: Leopoldo Bertossi

Abstract: We propose a generic numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. In particular, an inconsistency measure associated to cardinality-repairs is investigated in detail. More specifically, it is shown that it can be computed via answer-set programs, but sometimes its computation can be intractable in… ▽ More We propose a generic numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. In particular, an inconsistency measure associated to cardinality-repairs is investigated in detail. More specifically, it is shown that it can be computed via answer-set programs, but sometimes its computation can be intractable in data complexity. However, polynomial-time deterministic and randomized approximations are exhibited. The behavior of this measure under small updates is analyzed, obtaining fixed-parameter tractability results. Furthermore, alternative inconsistency measures are proposed and discussed. △ Less

Submitted 22 January, 2019; v1 submitted 26 September, 2018; originally announced September 2018.

Comments: Some editing made and some new paragraphs added

arXiv:1804.08834 [pdf, ps, other]

Measuring and Computing Database Inconsistency via Repairs

Authors: Leopoldo Bertossi

Abstract: We propose a generic numerical measure of inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. A particular inconsistency measure associated to cardinality-repairs is investigated; and we show that it can be computed via answer-set programs. Keywords: Integrity constraints in databases, inconsistent databases, database repairs,… ▽ More We propose a generic numerical measure of inconsistency of a database with respect to a set of integrity constraints. It is based on an abstract repair semantics. A particular inconsistency measure associated to cardinality-repairs is investigated; and we show that it can be computed via answer-set programs. Keywords: Integrity constraints in databases, inconsistent databases, database repairs, inconsistency measure. △ Less

Submitted 12 July, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: Submission as short paper; to appear in Proc. Scalable Uncertainty Management, SUM 2018. Abstract and keywords added

arXiv:1803.06445 [pdf, other]

Datalog: Bag Semantics via Set Semantics

Authors: Leopoldo Bertossi, Georg Gottlob, Reinhard Pichler

Abstract: Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called {\em warded Datalog}$^\pm$, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a prac… ▽ More Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called {\em warded Datalog}$^\pm$, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog$^\pm$ is extended to give a set semantics to duplicates in Datalog$^\pm$ itself. We investigate the properties of the resulting Datalog$^\pm$ programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics. △ Less

Submitted 12 February, 2019; v1 submitted 16 March, 2018; originally announced March 2018.

Comments: Extended version of paper appearing in Proc. ICDT 2019

arXiv:1712.01001 [pdf, other]

Specifying and Computing Causes for Query Answers in Databases via Database Repairs and Repair Programs

Authors: Leopoldo Bertossi

Abstract: A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute le… ▽ More A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute level by appealing to a both null-based and attribute-based repair semantics. The corresponding repair programs are presented, and they are used as a basis for computation and reasoning about attribute-level causes. They are extended to deal with the case of causality under integrity constraints. △ Less

Submitted 28 September, 2020; v1 submitted 4 December, 2017; originally announced December 2017.

Comments: To appear in "Knowledge and Information Systems" journal. This is the final version, and a much revised, corrected and extended version of: Bertossi, L. "Characterizing and Computing Causes for Query Answers in Databases from Database Repairs and Repair Programs". Proc. FoIKs, 2018, Springer LNCS 10833, pp. 55-76

arXiv:1704.05136 [pdf, ps, other]

The Causality/Repair Connection in Databases: Causality-Programs

Authors: Leopoldo Bertossi

Abstract: In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes for query answers from databases. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes for query answers from databases. △ Less

Submitted 26 June, 2017; v1 submitted 17 April, 2017; originally announced April 2017.

Comments: To appear in Proc. SUM'17 as short paper, 7-pages

arXiv:1704.00115 [pdf, other]

Ontological Multidimensional Data Models and Contextual Data Qality

Authors: Leopoldo Bertossi, Mostafa Milani

Abstract: Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow… ▽ More Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities. △ Less

Submitted 13 August, 2017; v1 submitted 31 March, 2017; originally announced April 2017.

Comments: Journal submission (revised version addressing reviewers' observations) Extended version of RuleML'15 paper

arXiv:1703.03524 [pdf, other]

The Ontological Multidimensional Data Model

Authors: Leopoldo Bertossi, Mostafa Milani

Abstract: In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some go… ▽ More In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some good computational properties of the model. △ Less

Submitted 3 May, 2017; v1 submitted 9 March, 2017; originally announced March 2017.

Comments: Extended abstract. This version with minor revisions and slightly extended. To appear in Proc. AMW'17

arXiv:1611.06951 [pdf, ps, other]

Enforcing Relational Matching Dependencies with Datalog for Entity Resolution

Authors: Zeinab Bahmani, Leopoldo Bertossi

Abstract: Entity resolution (ER) is about identifying and merging records in a database that represent the same real-world entity. Matching dependencies (MDs) have been introduced and investigated as declarative rules that specify ER policies. An ER process induced by MDs over a dirty instance leads to multiple clean instances, in general. General "answer sets programs" have been proposed to specify the MD-… ▽ More Entity resolution (ER) is about identifying and merging records in a database that represent the same real-world entity. Matching dependencies (MDs) have been introduced and investigated as declarative rules that specify ER policies. An ER process induced by MDs over a dirty instance leads to multiple clean instances, in general. General "answer sets programs" have been proposed to specify the MD-based cleaning task and its results. In this work, we extend MDs to "relational MDs", which capture more application semantics, and identify classes of relational MDs for which the general ASP can be automatically rewritten into a stratified Datalog program, with the single clean instance as its standard model. △ Less

Submitted 25 February, 2017; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: New revisions applied. To appear in Proc. FLAIRS'17

arXiv:1611.01711 [pdf, other]

Causes for Query Answers from Databases: Datalog Abduction, View-Updates, and Integrity Constraints

Authors: Leopoldo Bertossi, Babak Salimi

Abstract: Causality has been recently introduced in databases, to model, characterize, and possibly compute causes for query answers. Connections between QA-causality and consistency-based diagnosis and database repairs (wrt. integrity constraint violations) have already been established. In this work we establish precise connections between QA-causality and both abductive diagnosis and the view-update prob… ▽ More Causality has been recently introduced in databases, to model, characterize, and possibly compute causes for query answers. Connections between QA-causality and consistency-based diagnosis and database repairs (wrt. integrity constraint violations) have already been established. In this work we establish precise connections between QA-causality and both abductive diagnosis and the view-update problem in databases, allowing us to obtain new algorithmic and complexity results for QA-causality. We also obtain new results on the complexity of view-conditioned causality, and investigate the notion of QA-causality in the presence of integrity constraints, obtaining complexity results from a connection with view-conditioned causality. The abduction connection under integrity constraints allows us to obtain algorithmic tools for QA-causality. △ Less

Submitted 31 July, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

Comments: To appear in International Journal of Approximate Reasoning. Extended version of "Flairs'16" and "UAI'15 WS on Causality" papers

arXiv:1608.04142 [pdf, ps, other]

Contexts and Data Quality Assessment

Authors: Leopoldo Bertossi, Flavio Rizzolo

Abstract: The quality of data is context dependent. Starting from this intuition and experience, we propose and develop a conceptual framework that captures in formal terms the notion of "context-dependent data quality". We start by proposing a generic and abstract notion of context, and also of its uses, in general and in data management in particular. On this basis, we investigate "data quality assessment… ▽ More The quality of data is context dependent. Starting from this intuition and experience, we propose and develop a conceptual framework that captures in formal terms the notion of "context-dependent data quality". We start by proposing a generic and abstract notion of context, and also of its uses, in general and in data management in particular. On this basis, we investigate "data quality assessment" and "quality query answering" as context-dependent activities. A context for the assessment of a database D at hand is modeled as an external database schema, with possibly materialized or virtual data, and connections to external data sources. The database D is put in context via map**s to the contextual schema, which produces a collection C of alternative clean versions of D. The quality of D is measured in terms of its distance to C. The class C} is also used to define and do "quality query answering". The proposed model allows for natural extensions, like the use of data quality predicates, the optimization of the access by the context to external data sources, and also the representation of contexts by means of more expressive ontologies. △ Less

Submitted 14 August, 2016; originally announced August 2016.

arXiv:1607.02682 [pdf, ps, other]

Extending Weakly-Sticky Datalog+/-: Query-Answering Tractability and Optimizations

Authors: Mostafa Milani, Leopoldo Bertossi

Abstract: Weakly-sticky (WS) Datalog+/- is an expressive member of the family of Datalog+/- programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs… ▽ More Weakly-sticky (WS) Datalog+/- is an expressive member of the family of Datalog+/- programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs from the point of view of the behavior of the chase procedure, extending the stickiness property of the chase to that of generalized stickiness of the chase (gsch-property). With this property we specify the semantic class of GSCh programs, which includes sticky and WS programs, and other syntactic subclasses that we identify. In particular, we introduce joint-weakly-sticky (JWS) programs, that include WS programs. We also propose a bottom-up QA algorithm for a range of subclasses of GSCh. The algorithm runs in polynomial time (in data) for JWS programs. Unlike the WS class, JWS is closed under a general magic-sets rewriting procedure for the optimization of programs with existential rules. We apply the magic-sets rewriting in combination with the proposed QA algorithm for the optimization of QA over JWS programs. △ Less

Submitted 9 July, 2016; originally announced July 2016.

Comments: Extended version of RR'16 paper

arXiv:1606.01930 [pdf, other]

doi 10.1017/S147106841600017X

Consistency and Trust in Peer Data Exchange Systems

Authors: Leopoldo Bertossi, Loreto Bravo

Abstract: We propose and investigate a semantics for "peer data exchange systems" where different peers are related by data exchange constraints and trust relationships. These two elements plus the data at the peers' sites and their local integrity constraints are made compatible via a semantics that characterizes sets of "solution instances" for the peers. They are the intended -possibly virtual- instances… ▽ More We propose and investigate a semantics for "peer data exchange systems" where different peers are related by data exchange constraints and trust relationships. These two elements plus the data at the peers' sites and their local integrity constraints are made compatible via a semantics that characterizes sets of "solution instances" for the peers. They are the intended -possibly virtual- instances for a peer that are obtained through a data repair semantics that we introduce and investigate. The semantically correct answers from a peer to a query, the so-called "peer consistent answers", are defined as those answers that are invariant under all its different solution instances. We show that solution instances can be specified as the models of logic programs with a stable model semantics. The repair semantics is based on null values as used in SQL databases, and is also of independent interest for repairs of single databases with respect to integrity constraints. △ Less

Submitted 6 June, 2016; originally announced June 2016.

Comments: To appear in Theory and Practice of Logic Programming (TPLP). It includes appendix that will be published only in electronic format

arXiv:1605.07159 [pdf, other]

Complexity of Consistent Query Answering in Databases under Cardinality-Based and Incremental Repair Semantics (extended version)

Authors: Andrei Lopatenko, Leopoldo Bertossi

Abstract: A database D may be inconsistent wrt a given set IC of integrity constraints. Consistent Query Answering (CQA) is the problem of computing from D the answers to a query that are consistent wrt IC . Consistent answers are invariant under all the repairs of D, i.e. the consistent instances that minimally depart from D. Three classes of repair have been considered in the literature: those that minimi… ▽ More A database D may be inconsistent wrt a given set IC of integrity constraints. Consistent Query Answering (CQA) is the problem of computing from D the answers to a query that are consistent wrt IC . Consistent answers are invariant under all the repairs of D, i.e. the consistent instances that minimally depart from D. Three classes of repair have been considered in the literature: those that minimize set-theoretically the set of tuples in the symmetric difference; those that minimize the changes of attribute values, and those that minimize the cardinality of the set of tuples in the symmetric difference. The latter class has not been systematically investigated. In this paper we obtain algorithmic and complexity theoretic results for CQA under this cardinality-based repair semantics. We do this in the usual, static setting, but also in a dynamic framework where a consistent database is affected by a sequence of updates, which may make it inconsistent. We also establish comparative results with the other two kinds of repairs in the dynamic case. △ Less

Submitted 23 May, 2016; originally announced May 2016.

Comments: This paper, without the proofs provided here, arXiv:cs/0604002, appeared in the Proc. of ICDT 2007. This version contains all the proofs in correlation with the results reported in the ICDT paper (as opposed to a previous Arkiv Corr posting related to the same paper). One proof was corrected, and a corollary was added

arXiv:1604.06770 [pdf, ps, other]

A Hybrid Approach to Query Answering under Expressive Datalog+/-

Authors: Mostafa Milani, Andrea Cali, Leopoldo Bertossi

Abstract: Datalog+/- is a family of ontology languages that combine good computational properties with high expressive power. Datalog+/- languages are provably able to capture the most relevant Semantic Web languages. In this paper we consider the class of weakly-sticky (WS) Datalog+/- programs, which allow for certain useful forms of joins in rule bodies as well as extending the well-known class of weakly-… ▽ More Datalog+/- is a family of ontology languages that combine good computational properties with high expressive power. Datalog+/- languages are provably able to capture the most relevant Semantic Web languages. In this paper we consider the class of weakly-sticky (WS) Datalog+/- programs, which allow for certain useful forms of joins in rule bodies as well as extending the well-known class of weakly-acyclic TGDs. So far, only non-deterministic algorithms were known for answering queries on WS Datalog+/- programs. We present novel deterministic query answering algorithms under WS Datalog+/-. In particular, we propose: (1) a bottom-up grounding algorithm based on a query-driven chase, and (2) a hybrid approach based on transforming a WS program into a so-called sticky one, for which query rewriting techniques are known. We discuss how our algorithms can be optimized and effectively applied for query answering in real-world scenarios. △ Less

Submitted 25 July, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

Comments: Extended version of RR'16 paper, to appear

arXiv:1603.02705 [pdf, ps, other]

Quantifying Causal Effects on Query Answering in Databases

Authors: Babak Salimi, Leopoldo Bertossi, Dan Suciu, Guy Van den Broeck

Abstract: The notion of actual causation, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute actual causes for possibly unexpected answers to monotone queries. Causes take the form of database tuples, and can be ranked according to their causal responsibility, a numerical measure of their relevance as a cause to the query answer. In this work w… ▽ More The notion of actual causation, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute actual causes for possibly unexpected answers to monotone queries. Causes take the form of database tuples, and can be ranked according to their causal responsibility, a numerical measure of their relevance as a cause to the query answer. In this work we revisit this notion, introducing and making a case for an alternative measure of causal contribution, that of causal effect. The measure generalizes actual causes, and can be applied beyond monotone queries. We show that causal effect provides intuitive and intended results. △ Less

Submitted 24 April, 2016; v1 submitted 8 March, 2016; originally announced March 2016.

Comments: To appear in Proc. TAPP'16

ACM Class: H.2; I.2

arXiv:1602.06458 [pdf, other]

Causes for Query Answers from Databases, Datalog Abduction and View-Updates: The Presence of Integrity Constraints

Authors: Babak Salimi, Leopoldo Bertossi

Abstract: Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between queryanswer causality, consistency-based diagnosis, database repairs (wrt. integrity constraint violations), abductive diagnosis and the view-update problem have been established. In this work we further investigate connections between query-answe… ▽ More Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between queryanswer causality, consistency-based diagnosis, database repairs (wrt. integrity constraint violations), abductive diagnosis and the view-update problem have been established. In this work we further investigate connections between query-answer causality and abductive diagnosis and the view-update problem. In this context, we also define and investigate the notion of query-answer causality in the presence of integrity constraints. △ Less

Submitted 20 February, 2016; originally announced February 2016.

Comments: To appear in Proceedings Flairs, 2016

arXiv:1602.02334 [pdf, other]

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Authors: Zeinab Bahmani, Leopoldo Bertossi, Nikolaos Vasiloglou

Abstract: Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this… ▽ More Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs. △ Less

Submitted 18 January, 2017; v1 submitted 6 February, 2016; originally announced February 2016.

Comments: Final journal version, with some minor technical corrections. Extended version of arXiv:1508.06013

arXiv:1508.06013 [pdf, other]

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Authors: Zeinab Bahmani, Leopoldo Bertossi, Nikolaos Vasiloglou

Abstract: Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this wo… ▽ More Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs. △ Less

Submitted 24 August, 2015; originally announced August 2015.

Comments: To appear in Proc. SUM, 2015

Journal ref: Proc. SUM'15, 2015, Springer LNAI 9310, pp. 399-414

arXiv:1507.00257 [pdf, ps, other]

From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back

Authors: Leopoldo Bertossi, Babak Salimi

Abstract: In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality p… ▽ More In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problems of computing actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for database causality. △ Less

Submitted 23 October, 2016; v1 submitted 1 July, 2015; originally announced July 2015.

Comments: To appear in Theory of Computing Systems. By invitation to special issue with extended papers from ICDT 2015 (paper arXiv:1412.4311)

arXiv:1506.04299 [pdf, ps, other]

Query-Answer Causality in Databases: Abductive Diagnosis and View-Updates

Authors: Babak Salimi, Leopoldo Bertossi

Abstract: Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-… ▽ More Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-update problem. The unveiled relationships allow us to obtain new complexity results for query causality -the main focus of our work- and also for the two other areas. △ Less

Submitted 19 September, 2015; v1 submitted 13 June, 2015; originally announced June 2015.

Comments: To appear in Proc. UAI Causal Inference Workshop, 2015. One example was fixed

arXiv:1504.03386 [pdf, ps, other]

Tractable Query Answering and Optimization for Extensions of Weakly-Sticky Datalog+-

Authors: Mostafa Milani, Leopoldo Bertossi

Abstract: We consider a semantic class, weakly-chase-sticky (WChS), and a syntactic subclass, jointly-weakly-sticky (JWS), of Datalog+- programs. Both extend that of weakly-sticky (WS) programs, which appear in our applications to data quality. For WChS programs we propose a practical, polynomial-time query answering algorithm (QAA). We establish that the two classes are closed under magic-sets rewritings.… ▽ More We consider a semantic class, weakly-chase-sticky (WChS), and a syntactic subclass, jointly-weakly-sticky (JWS), of Datalog+- programs. Both extend that of weakly-sticky (WS) programs, which appear in our applications to data quality. For WChS programs we propose a practical, polynomial-time query answering algorithm (QAA). We establish that the two classes are closed under magic-sets rewritings. As a consequence, QAA can be applied to the optimized programs. QAA takes as inputs the program (including the query) and semantic information about the "finiteness" of predicate positions. For the syntactic subclasses JWS and WS of WChS, this additional information is computable. △ Less

Submitted 13 April, 2015; originally announced April 2015.

Comments: To appear in Proc. Alberto Mendelzon WS on Foundations of Data Management (AMW15)

arXiv:1412.4311 [pdf, other]

From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back

Authors: Babak Salimi, Leopoldo Bertossi

Abstract: In this work we establish and investigate connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. Causality probl… ▽ More In this work we establish and investigate connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problem of determining actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for causality in databases. △ Less

Submitted 13 December, 2014; originally announced December 2014.

Comments: Extended version of paper to appear in Proceedings of ICDT 2015

arXiv:1405.4228 [pdf, ps, other]

Unifying Causality, Diagnosis, Repairs and View-Updates in Databases

Authors: Leopoldo Bertossi, Babak Salimi

Abstract: In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they… ▽ More In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they have in common. In one way or another, these are all approaches to uncertainty management, which becomes even more relevant in the context of big data that have to be made sense of. △ Less

Submitted 28 June, 2014; v1 submitted 16 May, 2014; originally announced May 2014.

Comments: On-line Proc. First International Workshop on Big Uncertain Data (BUDA 2014). Co-located with ACM PODS 2014. arXiv admin note: text overlap with arXiv:1404.6857

arXiv:1404.6857 [pdf, ps, other]

Causality in Databases: The Diagnosis and Repair Connections

Authors: Babak Salimi, Leopoldo Bertossi

Abstract: In this work we establish and investigate the connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. The vast bo… ▽ More In this work we establish and investigate the connections between causality for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new problems in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes and the other way around. The vast body of research on database repairs can be applied to the newer problem of determining actual causes for query answers. By formulating a causality problem as a diagnosis problem, we manage to characterize causes in terms of a system's diagnoses. △ Less

Submitted 28 June, 2014; v1 submitted 27 April, 2014; originally announced April 2014.

Comments: Proc. 15th International Workshop on Non-Monotonic Reasoning (NMR 2014)

arXiv:1312.7373 [pdf, ps, other]

Extending Contexts with Ontologies for Multidimensional Data Quality Assessment

Authors: Mostafa Milani, Leopoldo Bertossi, Sina Ariyan

Abstract: Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and… ▽ More Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and processing, enabling quality assessment. In this work we extend contexts with dimensions, and by doing so, we make possible a multidimensional assessment of data quality assessment. Multidimensional contexts are represented as ontologies written in Datalog+-. We use this language for representing dimensional constraints, and dimensional rules, and also for doing query answering based on dimensional navigation, which becomes an important auxiliary activity in the assessment of data. We show ideas and mechanisms by means of examples. △ Less

Submitted 20 January, 2014; v1 submitted 27 December, 2013; originally announced December 2013.

Comments: To appear in Proc. 5th International Workshop on Data Engineering meets the Semantic Web (DESWeb). In conjunction with ICDE 2014

arXiv:1309.1884 [pdf, ps, other]

Tractable vs. Intractable Cases of Matching Dependencies for Query Answering under Entity Resolution

Authors: Leopoldo Bertossi, Jaffer Gardezi

Abstract: Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, on the basis of similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them.… ▽ More Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, on the basis of similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The resolved answers to queries are those that are invariant under the resulting class of resolved instances. Previous work identified certain classes of queries and sets of MDs for which resolved query answering is tractable. Special emphasis was placed on cyclic sets of MDs. In this work we further investigate the complexity of this problem, identifying intractable cases, and exploring the frontier between tractability and intractability. We concentrate mostly on acyclic sets of MDs. For a special case we obtain a dichotomy result relative to NP-hardness. △ Less

Submitted 6 April, 2014; v1 submitted 7 September, 2013; originally announced September 2013.

arXiv:1304.7854 [pdf, ps, other]

On the Complexity of Query Answering under Matching Dependencies for Entity Resolution

Authors: Leopoldo Bertossi, Jaffer Gardezi

Abstract: Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The re… ▽ More Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The resolved answers to queries are those that are invariant under the resulting class of resolved instances. In previous work we identified some tractable cases (i.e. for certain classes of queries and MDs) of resolved query answering. In this paper we further investigate the complexity of this problem, identifying some intractable cases. For a special case we obtain a dichotomy complexity result. △ Less

Submitted 26 May, 2013; v1 submitted 30 April, 2013; originally announced April 2013.

Comments: To appear in Proc. of the Alberto Mendelzon International Workshop on Foundations of Data Management (AMW 2013)

arXiv:1112.5908 [pdf, ps, other]

Query Answering under Matching Dependencies for Data Cleaning: Complexity and Algorithms

Authors: Jaffer Gardezi, Leopoldo Bertossi

Abstract: Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance $D$. A set of MDs can be used as the basis for a possibly non-deterministic mechanism that computes a duplicate-free instance from $D$. The possible results of this process are the clean, "minimally resolved instances" (MRIs… ▽ More Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance $D$. A set of MDs can be used as the basis for a possibly non-deterministic mechanism that computes a duplicate-free instance from $D$. The possible results of this process are the clean, "minimally resolved instances" (MRIs). There might be several MRIs for $D$, and the "resolved answers" to a query are those that are shared by all the MRIs. We investigate the problem of computing resolved answers. We look at various sets of MDs, develo** syntactic criteria for determining (in)tractability of the resolved answer problem, including a dichotomy result. For some tractable classes of MDs and conjunctive queries, we present a query rewriting methodology that can be used to retrieve the resolved answers. We also investigate connections with "consistent query answering", deriving further tractability results for MD-based ER. △ Less

Submitted 26 December, 2011; originally announced December 2011.

Comments: Conference submission, 2011

arXiv:1106.1478 [pdf, ps, other]

Consistent Query Answering under Spatial Semantic Constraints

Authors: M. Andrea Rodríguez, Leopoldo Bertossi, Monica Caniupan

Abstract: Consistent query answering is an inconsistency tolerant approach to obtaining semantically correct answers from a database that may be inconsistent with respect to its integrity constraints. In this work we formalize the notion of consistent query answer for spatial databases and spatial semantic integrity constraints. In order to do this, we first characterize conflicting spatial data, and next,… ▽ More Consistent query answering is an inconsistency tolerant approach to obtaining semantically correct answers from a database that may be inconsistent with respect to its integrity constraints. In this work we formalize the notion of consistent query answer for spatial databases and spatial semantic integrity constraints. In order to do this, we first characterize conflicting spatial data, and next, we define admissible instances that restore consistency while staying close to the original instance. In this way we obtain a repair semantics, which is used as an instrumental concept to define and possibly derive consistent query answers. We then concentrate on a class of spatial denial constraints and spatial queries for which there exists an efficient strategy to compute consistent query answers. This study applies inconsistency tolerance in spatial databases, rising research issues that shift the goal from the consistency of a spatial database to the consistency of query answering. △ Less

Submitted 7 June, 2011; originally announced June 2011.

Comments: Journal submission, 2010

arXiv:1105.1364 [pdf, ps, other]

Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates

Authors: Leopoldo Bertossi, Lechen Li

Abstract: There may be sensitive information in a relational database, and we might want to keep it hidden from a user or group thereof. In this work, sensitive data is characterized as the contents of a set of secrecy views. For a user without permission to access that sensitive data, the database instance he queries is updated to make the contents of the views empty or contain only tuples with null values… ▽ More There may be sensitive information in a relational database, and we might want to keep it hidden from a user or group thereof. In this work, sensitive data is characterized as the contents of a set of secrecy views. For a user without permission to access that sensitive data, the database instance he queries is updated to make the contents of the views empty or contain only tuples with null values. In particular, if this user poses a query about any of these views, no meaningful information is returned. Since the database is not expected to be physically changed to produce this result, the updates are only virtual. And also minimal in a precise way. These minimal updates are reflected in the secrecy view contents, and also in the fact that query answers, while being privacy preserving, are also maximally informative. Virtual updates are based on the use of null values as used in the SQL standard. We provide the semantics of secrecy views and the virtual updates. The different ways in which the underlying database is virtually updated are specified as the models of a logic program with stable model semantics. The program becomes the basis for the computation of the "secret answers" to queries, i.e. those that do not reveal the sensitive information. △ Less

Submitted 5 April, 2012; v1 submitted 6 May, 2011; originally announced May 2011.

Comments: Minor revisions of journal resubmission, 2012

ACM Class: H.2; F.4.1

Journal ref: IEEE Transaction on Knowledge and Data Engineering, 2013, 25(5):987-1000

arXiv:1008.4627 [pdf, ps, other]

Matching Dependencies with Arbitrary Attribute Values: Semantics, Query Answering and Integrity Constraints

Authors: Jaffer Gardezi, Leopoldo Bertossi, Iluju Kiringa

Abstract: Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the "pure case" of MDs, any value from the underlying data domain can be used for the value in common that does the… ▽ More Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the "pure case" of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the "clean answers" to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs and queries are identified. Finally, we establish connections with database "repairs" under integrity constraints. △ Less

Submitted 26 August, 2010; originally announced August 2010.

Comments: 13 pages, double column, 2 figures

ACM Class: H.2; H.2.0; H.2.3

Showing 1–50 of 57 results for author: Bertossi, L