Skip to main content

Showing 1–7 of 7 results for author: Paton, N W

.
  1. arXiv:2405.17723  [pdf, other

    cs.DB

    TableDC: Deep Clustering for Tabular Data

    Authors: Hafiz Tayyab Rauf, Andre Freitas, Norman W. Paton

    Abstract: Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data i… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  2. arXiv:2305.13494  [pdf, other

    cs.DB

    Deep Clustering for Data Cleaning and Integration

    Authors: Hafiz Tayyab Rauf, Andre Freitas, Norman W. Paton

    Abstract: Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying t… ▽ More

    Submitted 22 September, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: The following enhancements have been carried out in the updated version of the manuscript: *Evaluated each data integration problem on additional datasets. *Added more DC and SC methods to the evaluation *Discussed algorithmic-specific observations

  3. arXiv:2206.03881  [pdf, other

    cs.DB

    Towards Schema Inference for Data Lakes

    Authors: Nour Alhammad, Alex Bogatu, Norman W Paton

    Abstract: A data lake is a repository of data with potential for future analysis. However, both discovering what data is in a data lake and exploring related data sets can take significant effort, as a data lake can contain an intimidating amount of heterogeneous data. In this paper, we propose the use of schema inference to support the interpretation of the data in the data lake. If a data lake is to suppo… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

    Comments: 13 pages, 2 figures

  4. Dataset Discovery in Data Lakes

    Authors: Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, Nikolaos Konstantinou

    Abstract: Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of ho… ▽ More

    Submitted 20 November, 2020; originally announced November 2020.

    Journal ref: 2020 IEEE 36th International Conference on Data Engineering (ICDE)

  5. arXiv:2011.10406  [pdf, other

    cs.LG cs.DB

    Cost-effective Variational Active Entity Resolution

    Authors: Alex Bogatu, Norman W. Paton, Mark Douthwaite, Stuart Davie, Andre Freitas

    Abstract: Accurately identifying different representations of the same real-world entity is an integral part of data cleaning and many methods have been proposed to accomplish it. The challenges of this entity resolution task that demand so much research attention are often rooted in the task-specificity and user-dependence of the process. Adopting deep learning techniques has the potential to lessen these… ▽ More

    Submitted 26 February, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

    Journal ref: 2021 IEEE 37th International Conference on Data Engineering (ICDE)

  6. Data Context Informed Data Wrangling

    Authors: Martin Koehler, Alex Bogatu, Cristina Civili, Nikolaos Konstantinou, Edward Abel, Alvaro A. A. Fernandes, John Keane, Leonid Libkin, Norman W. Paton

    Abstract: The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need… ▽ More

    Submitted 22 November, 2018; originally announced November 2018.

    Journal ref: 2017 IEEE International Conference on Big Data (Big Data), pp. 956-963, Boston, MA, 11-14 December, 2017

  7. Guidelines for reporting the use of gel electrophoresis in proteomics

    Authors: Frank Gibson, Leigh Anderson, Gyorgy Babnigg, Mark Baker, Matthias Berth, Pierre-Alain Binz, Andy Borthwick, Phil Cash, Billy W Day, David B Friedman, Donita Garland, Howard B Gutstein, Christine Hoogland, Neil A Jones, Alamgir Khan, Joachim Klose, Angus I Lamond, Peter F Lemkin, Kathryn S Lilley, Jonathan Minden, Nicholas J Morris, Norman W Paton, Michael R Pisano, John E Prime, Thierry Rabilloud , et al. (5 additional authors not shown)

    Abstract: the MIAPE Gel Electrophoresis (MIAPE-GE) guidelines specify the minimum information that should be provided when reporting the use of n-dimensional gel electrophoresis in a proteomics experiment. Developed through a joint effort between the gel-based analysis working group of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO-PSI; http://www.psidev.info/) and the wider prote… ▽ More

    Submitted 4 April, 2009; originally announced April 2009.

    Journal ref: Nat Biotechnol 26, 8 (2008) 863-4