-
TableDC: Deep Clustering for Tabular Data
Authors:
Hafiz Tayyab Rauf,
Andre Freitas,
Norman W. Paton
Abstract:
Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data i…
▽ More
Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data integration and cleaning). In data management tasks, where high-density embeddings and overlap** clusters dominate, a data management-specific DC algorithm should be able to interact better with the data properties for supporting data cleaning and integration tasks. This paper presents a deep clustering algorithm for tabular data (TableDC) that reflects the properties of data management applications, particularly schema inference, entity resolution, and domain discovery. To address overlap** clusters, TableDC integrates Mahalanobis distance, which considers variance and correlation within the data, offering a similarity method suitable for tables, rows, or columns in high-dimensional latent spaces. TableDC provides flexibility for the final clustering assignment and shows higher tolerance to outliers through its heavy-tailed Cauchy distribution as the similarity kernel. The proposed similarity measure is particularly beneficial where the embeddings of raw data are densely packed and exhibit high degrees of overlap. Data cleaning tasks may involve a large number of clusters, which affects the scalability of existing DC methods. TableDC's self-supervised module efficiently learns data embeddings with a large number of clusters compared to existing benchmarks, which scale in quadratic time. We evaluated TableDC with several existing DC, Standard Clustering (SC), and state-of-the-art bespoke methods over benchmark datasets. TableDC consistently outperforms existing DC, SC, and bespoke methods.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Deep Clustering for Data Cleaning and Integration
Authors:
Hafiz Tayyab Rauf,
Andre Freitas,
Norman W. Paton
Abstract:
Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying t…
▽ More
Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks remains unexplored. In this paper, we address this gap by investigating the impact of DC in data cleaning and integration tasks, specifically schema inference, entity resolution, and domain discovery, tasks that represent clustering from the perspective of tables, rows, and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we observed a significant correlation between the DC method and embedding approaches for rows, columns, and tables, highlighting that the suitable combination can enhance the efficiency of DC methods.
△ Less
Submitted 22 September, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Towards Schema Inference for Data Lakes
Authors:
Nour Alhammad,
Alex Bogatu,
Norman W Paton
Abstract:
A data lake is a repository of data with potential for future analysis. However, both discovering what data is in a data lake and exploring related data sets can take significant effort, as a data lake can contain an intimidating amount of heterogeneous data. In this paper, we propose the use of schema inference to support the interpretation of the data in the data lake. If a data lake is to suppo…
▽ More
A data lake is a repository of data with potential for future analysis. However, both discovering what data is in a data lake and exploring related data sets can take significant effort, as a data lake can contain an intimidating amount of heterogeneous data. In this paper, we propose the use of schema inference to support the interpretation of the data in the data lake. If a data lake is to support a schema-on-read paradigm, understanding the existing schema of relevant portions of the data lake seems like a prerequisite. In this paper, we make use of approximate indexes that can be used for data discovery to inform the inference of a schema for a data lake, consisting of entity types and the relationships between them. The specific approach identifies candidate entity types by clustering similar data sets from the data lake, and then relationships between data sets in different clusters are used to inform the identification of relationships between the entity types. The approach is evaluated using real-world data repositories, to identify where the proposal is effective, and to inform the identification of areas for further work.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
Dataset Discovery in Data Lakes
Authors:
Alex Bogatu,
Alvaro A. A. Fernandes,
Norman W. Paton,
Nikolaos Konstantinou
Abstract:
Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of ho…
▽ More
Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Cost-effective Variational Active Entity Resolution
Authors:
Alex Bogatu,
Norman W. Paton,
Mark Douthwaite,
Stuart Davie,
Andre Freitas
Abstract:
Accurately identifying different representations of the same real-world entity is an integral part of data cleaning and many methods have been proposed to accomplish it. The challenges of this entity resolution task that demand so much research attention are often rooted in the task-specificity and user-dependence of the process. Adopting deep learning techniques has the potential to lessen these…
▽ More
Accurately identifying different representations of the same real-world entity is an integral part of data cleaning and many methods have been proposed to accomplish it. The challenges of this entity resolution task that demand so much research attention are often rooted in the task-specificity and user-dependence of the process. Adopting deep learning techniques has the potential to lessen these challenges. In this paper, we set out to devise an entity resolution method that builds on the robustness conferred by deep autoencoders to reduce human-involvement costs. Specifically, we reduce the cost of training deep entity resolution models by performing unsupervised representation learning. This unveils a transferability property of the resulting model that can further reduce the cost of applying the approach to new datasets by means of transfer learning. Finally, we reduce the cost of labelling training data through an active learning approach that builds on the properties conferred by the use of deep autoencoders. Empirical evaluation confirms the accomplishment of our cost-reduction desideratum while achieving comparable effectiveness with state-of-the-art alternatives.
△ Less
Submitted 26 February, 2021; v1 submitted 20 November, 2020;
originally announced November 2020.
-
Data Context Informed Data Wrangling
Authors:
Martin Koehler,
Alex Bogatu,
Cristina Civili,
Nikolaos Konstantinou,
Edward Abel,
Alvaro A. A. Fernandes,
John Keane,
Leonid Libkin,
Norman W. Paton
Abstract:
The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need…
▽ More
The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, map** validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling.
△ Less
Submitted 22 November, 2018;
originally announced November 2018.
-
Guidelines for reporting the use of gel electrophoresis in proteomics
Authors:
Frank Gibson,
Leigh Anderson,
Gyorgy Babnigg,
Mark Baker,
Matthias Berth,
Pierre-Alain Binz,
Andy Borthwick,
Phil Cash,
Billy W Day,
David B Friedman,
Donita Garland,
Howard B Gutstein,
Christine Hoogland,
Neil A Jones,
Alamgir Khan,
Joachim Klose,
Angus I Lamond,
Peter F Lemkin,
Kathryn S Lilley,
Jonathan Minden,
Nicholas J Morris,
Norman W Paton,
Michael R Pisano,
John E Prime,
Thierry Rabilloud
, et al. (5 additional authors not shown)
Abstract:
the MIAPE Gel Electrophoresis (MIAPE-GE) guidelines specify the minimum information that should be provided when reporting the use of n-dimensional gel electrophoresis in a proteomics experiment. Developed through a joint effort between the gel-based analysis working group of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO-PSI; http://www.psidev.info/) and the wider prote…
▽ More
the MIAPE Gel Electrophoresis (MIAPE-GE) guidelines specify the minimum information that should be provided when reporting the use of n-dimensional gel electrophoresis in a proteomics experiment. Developed through a joint effort between the gel-based analysis working group of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO-PSI; http://www.psidev.info/) and the wider proteomics community, they constitute one part of the overall Minimum Information about a Proteomics Experiment (MIAPE) documentation system published last August in Nature Biotechnology
△ Less
Submitted 4 April, 2009;
originally announced April 2009.