Search | arXiv e-print repository

Effective Explanations for Entity Resolution Models

Authors: Tommaso Teofili, Donatella Firmani, Nick Koudas, Vincenzo Martello, Paolo Merialdo, Divesh Srivastava

Abstract: Entity resolution (ER) aims at matching records that refer to the same real-world entity. Although widely studied for the last 50 years, ER still represents a challenging data management problem, and several recent works have started to investigate the opportunity of applying deep learning (DL) techniques to solve this problem. In this paper, we study the fundamental problem of explainability of t… ▽ More Entity resolution (ER) aims at matching records that refer to the same real-world entity. Although widely studied for the last 50 years, ER still represents a challenging data management problem, and several recent works have started to investigate the opportunity of applying deep learning (DL) techniques to solve this problem. In this paper, we study the fundamental problem of explainability of the DL solution for ER. Understanding the matching predictions of an ER solution is indeed crucial to assess the trustworthiness of the DL model and to discover its biases. We treat the DL model as a black box classifier and - while previous approaches to provide explanations for DL predictions are agnostic to the classification task. we propose the CERTA approach that is aware of the semantics of the ER problem. Our approach produces both saliency explanations, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip the prediction. CERTA builds on a probabilistic framework that aims at computing the explanations evaluating the outcomes produced by using perturbed copies of the input records. We experimentally evaluate CERTA's explanations of state-of-the-art ER solutions based on DL models using publicly available datasets, and demonstrate the effectiveness of CERTA over recently proposed methods for this problem. △ Less

Submitted 1 April, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

arXiv:2101.11259 [pdf, other]

Alaska: A Flexible Benchmark for Data Integration Tasks

Authors: Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, Divesh Srivastava

Abstract: Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task… ▽ More Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions. △ Less

Submitted 3 February, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

arXiv:2005.14326 [pdf, other]

doi 10.1007/s00778-021-00656-7

Efficient and Effective ER with Progressive Blocking

Authors: Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava

Abstract: Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this pap… ▽ More Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ($O(n log^2 n)$ time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%. △ Less

Submitted 16 March, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

Comments: Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal (2021)

arXiv:2002.00819 [pdf, other]

doi 10.1145/3424672

Knowledge Graph Embedding for Link Prediction: A Comparative Analysis

Authors: Andrea Rossi, Donatella Firmani, Antonio Matinata, Paolo Merialdo, Denilson Barbosa

Abstract: Knowledge Graphs (KGs) have found many applications in industry and academic settings, which in turn, have motivated considerable research efforts towards large-scale information extraction from a variety of sources. Despite such efforts, it is well known that even state-of-the-art KGs suffer from incompleteness. Link Prediction (LP), the task of predicting missing facts among entities already a K… ▽ More Knowledge Graphs (KGs) have found many applications in industry and academic settings, which in turn, have motivated considerable research efforts towards large-scale information extraction from a variety of sources. Despite such efforts, it is well known that even state-of-the-art KGs suffer from incompleteness. Link Prediction (LP), the task of predicting missing facts among entities already a KG, is a promising and widely studied task aimed at addressing KG incompleteness. Among the recent LP techniques, those based on KG embeddings have achieved very promising performances in some benchmarks. Despite the fast growing literature in the subject, insufficient attention has been paid to the effect of the various design choices in those methods. Moreover, the standard practice in this area is to report accuracy by aggregating over a large number of test facts in which some entities are over-represented; this allows LP methods to exhibit good performance by just attending to structural properties that include such entities, while ignoring the remaining majority of the KG. This analysis provides a comprehensive comparison of embedding-based LP methods, extending the dimensions of analysis beyond what is commonly available in the literature. We experimentally compare effectiveness and efficiency of 16 state-of-the-art methods, consider a rule-based baseline, and report detailed analysis over the most popular benchmarks in the literature. △ Less

Submitted 21 January, 2021; v1 submitted 3 February, 2020; originally announced February 2020.

Comments: Andrea Rossi, Donatella Firmani, Antonio Matinata, Paolo Merialdo, Denilson Barbosa. 2020. Knowledge Graph Embedding for Link Prediction: A Comparative Analysis. In ACM Transactions on Knowledge Discovery from Data. January 2021. (TKDD 2021). ACM, New York, NY, USA

arXiv:1901.10232 [pdf, other]

Multikernel activation functions: formulation and a case study

Authors: Simone Scardapane, Elena Nieddu, Donatella Firmani, Paolo Merialdo

Abstract: The design of activation functions is a growing research area in the field of neural networks. In particular, instead of using fixed point-wise functions (e.g., the rectified linear unit), several authors have proposed ways of learning these functions directly from the data in a non-parametric fashion. In this paper we focus on the kernel activation function (KAF), a recently proposed framework wh… ▽ More The design of activation functions is a growing research area in the field of neural networks. In particular, instead of using fixed point-wise functions (e.g., the rectified linear unit), several authors have proposed ways of learning these functions directly from the data in a non-parametric fashion. In this paper we focus on the kernel activation function (KAF), a recently proposed framework wherein each function is modeled as a one-dimensional kernel model, whose weights are adapted through standard backpropagation-based optimization. One drawback of KAFs is the need to select a single kernel function and its eventual hyper-parameters. To partially overcome this problem, we motivate an extension of the KAF model, in which multiple kernels are linearly combined at every neuron, inspired by the literature on multiple kernel learning. We provide an application of the resulting multi-KAF on a realistic use case, specifically handwritten Latin OCR, on a large dataset collected in the context of the `In Codice Ratio' project. Results show that multi-KAFs can improve the accuracy of the convolutional networks previously developed for the task, with faster convergence, even with a smaller number of overall parameters. △ Less

Submitted 29 January, 2019; originally announced January 2019.

Comments: Accepted for presentation at INNS BDDL 2019 (https://innsbddl2019.org)

arXiv:1803.03200 [pdf, other]

doi 10.1145/3219819.3219879

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

Authors: Donatella Firmani, Marco Maiorino, Paolo Merialdo, Elena Nieddu

Abstract: In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored i… ▽ More In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale. △ Less

Submitted 12 September, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

Comments: Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu. 2018. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 263-272

arXiv:1202.0319 [pdf, other]

doi 10.1002/net.21450

Real-Time Monitoring of Undirected Networks: Articulation Points, Bridges, and Connected and Biconnected Components

Authors: Giorgio Ausiello, Donatella Firmani, Luigi Laura

Abstract: In this paper we present the first algorithm in the streaming model to characterize completely the biconnectivity properties of undirected networks: articulation points, bridges, and connected and biconnected components. The motivation of our work was the development of a real-time algorithm to monitor the connectivity of the Autonomous Systems (AS) Network, but the solution provided is general en… ▽ More In this paper we present the first algorithm in the streaming model to characterize completely the biconnectivity properties of undirected networks: articulation points, bridges, and connected and biconnected components. The motivation of our work was the development of a real-time algorithm to monitor the connectivity of the Autonomous Systems (AS) Network, but the solution provided is general enough to be applied to any network. The network structure is represented by a graph, and the algorithm is analyzed in the datastream framework. Here, as in the \emph{on-line} model, the input graph is revealed one item (i.e., graph edge) after the other, in an on-line fashion; but, if compared to traditional on-line computation, there are stricter requirements for both memory occupation and per item processing time. Our algorithm works by properly updating a forest over the graph nodes. All the graph (bi)connectivity properties are stored in this forest. We prove the correctness of the algorithm, together with its space ($O(n\,\log n)$, with $n$ being the number of nodes in the graph) and time bounds. We also present the results of a brief experimental evaluation against real-world graphs, including many samples of the AS network, ranging from medium to massive size. These preliminary experimental results confirm the effectiveness of our approach. △ Less

Submitted 1 February, 2012; originally announced February 2012.

Showing 1–7 of 7 results for author: Firmani, D