Search | arXiv e-print repository

TableDC: Deep Clustering for Tabular Data

Authors: Hafiz Tayyab Rauf, Andre Freitas, Norman W. Paton

Abstract: Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data i… ▽ More Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data integration and cleaning). In data management tasks, where high-density embeddings and overlap** clusters dominate, a data management-specific DC algorithm should be able to interact better with the data properties for supporting data cleaning and integration tasks. This paper presents a deep clustering algorithm for tabular data (TableDC) that reflects the properties of data management applications, particularly schema inference, entity resolution, and domain discovery. To address overlap** clusters, TableDC integrates Mahalanobis distance, which considers variance and correlation within the data, offering a similarity method suitable for tables, rows, or columns in high-dimensional latent spaces. TableDC provides flexibility for the final clustering assignment and shows higher tolerance to outliers through its heavy-tailed Cauchy distribution as the similarity kernel. The proposed similarity measure is particularly beneficial where the embeddings of raw data are densely packed and exhibit high degrees of overlap. Data cleaning tasks may involve a large number of clusters, which affects the scalability of existing DC methods. TableDC's self-supervised module efficiently learns data embeddings with a large number of clusters compared to existing benchmarks, which scale in quadratic time. We evaluated TableDC with several existing DC, Standard Clustering (SC), and state-of-the-art bespoke methods over benchmark datasets. TableDC consistently outperforms existing DC, SC, and bespoke methods. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2305.13494 [pdf, other]

Deep Clustering for Data Cleaning and Integration

Authors: Hafiz Tayyab Rauf, Andre Freitas, Norman W. Paton

Abstract: Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying t… ▽ More Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks remains unexplored. In this paper, we address this gap by investigating the impact of DC in data cleaning and integration tasks, specifically schema inference, entity resolution, and domain discovery, tasks that represent clustering from the perspective of tables, rows, and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we observed a significant correlation between the DC method and embedding approaches for rows, columns, and tables, highlighting that the suitable combination can enhance the efficiency of DC methods. △ Less

Submitted 22 September, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: The following enhancements have been carried out in the updated version of the manuscript: *Evaluated each data integration problem on additional datasets. *Added more DC and SC methods to the evaluation *Discussed algorithmic-specific observations

arXiv:2209.01073 [pdf]

doi 10.1155/2022/7055910

Improved Fitness Dependent Optimizer for Solving Economic Load Dispatch Problem

Authors: Barzan Hussein Tahir, Tarik A. Rashid, Hafiz Tayyab Rauf, Nebojsa Bacanin, Amit Chhabra, S. Vimal, Zaher Mundher Yaseen

Abstract: Economic Load Dispatch depicts a fundamental role in the operation of power systems, as it decreases the environmental load, minimizes the operating cost, and preserves energy resources. The optimal solution to Economic Load Dispatch problems and various constraints can be obtained by evolving several evolutionary and swarm-based algorithms. The major drawback to swarm-based algorithms is prematur… ▽ More Economic Load Dispatch depicts a fundamental role in the operation of power systems, as it decreases the environmental load, minimizes the operating cost, and preserves energy resources. The optimal solution to Economic Load Dispatch problems and various constraints can be obtained by evolving several evolutionary and swarm-based algorithms. The major drawback to swarm-based algorithms is premature convergence towards an optimal solution. Fitness Dependent Optimizer is a novel optimization algorithm stimulated by the decision-making and reproductive process of bee swarming. Fitness Dependent Optimizer (FDO) examines the search spaces based on the searching approach of Particle Swarm Optimization. To calculate the pace, the fitness function is utilized to generate weights that direct the search agents in the phases of exploitation and exploration. In this research, the authors have carried out Fitness Dependent Optimizer to solve the Economic Load Dispatch problem by reducing fuel cost, emission allocation, and transmission loss. Moreover, the authors have enhanced a novel variant of Fitness Dependent Optimizer, which incorporates novel population initialization techniques and dynamically employed sine maps to select the weight factor for Fitness Dependent Optimizer. The enhanced population initialization approach incorporates a quasi-random Sabol sequence to generate the initial solution in the multi-dimensional search space. A standard 24-unit system is employed for experimental evaluation with different power demands. Empirical results obtained using the enhanced variant of the Fitness Dependent Optimizer demonstrate superior performance in terms of low transmission loss, low fuel cost, and low emission allocation compared to the conventional Fitness Dependent Optimizer. The experimental study obtained 7.94E-12. △ Less

Submitted 14 July, 2022; originally announced September 2022.

Comments: 42 pages

Journal ref: Computational Intelligence and Neuroscience (2022)

Showing 1–3 of 3 results for author: Rauf, H T