Skip to main content

Showing 1–11 of 11 results for author: Hulsebos, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.04674  [pdf, other

    cs.DB

    Towards Accurate and Efficient Document Analytics with Large Language Models

    Authors: Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, Eugene Wu

    Abstract: Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents t… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  2. arXiv:2401.03038  [pdf, other

    cs.DB cs.SE

    SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines

    Authors: Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu

    Abstract: Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose {\em data quality assertions} to identify when LLMs may be making mistakes. We present SPADE, a m… ▽ More

    Submitted 31 March, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

    Comments: 17 pages, 6 figures

  3. arXiv:2311.13806  [pdf, other

    cs.DB cs.CL cs.LG

    AdaTyper: Adaptive Semantic Column Type Detection

    Authors: Madelon Hulsebos, Paul Groth, Çağatay Demiralp

    Abstract: Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a ga… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Submitted to VLDB'24

  4. arXiv:2310.07736  [pdf, other

    cs.DB cs.LG

    Observatory: Characterizing Embeddings of Relational Tables

    Authors: Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish

    Abstract: Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model fo… ▽ More

    Submitted 27 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Camera ready of VLDB 2024

  5. arXiv:2109.06160  [pdf, other

    cs.DB cs.HC cs.LG

    Augmenting Decision Making via Interactive What-If Analysis

    Authors: Sneha Gathani, Madelon Hulsebos, James Gale, Peter J. Haas, Çağatay Demiralp

    Abstract: The fundamental goal of business data analysis is to improve business decisions using data. Business users often make decisions to achieve key performance indicators (KPIs) such as increasing customer retention or sales, or decreasing costs. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to… ▽ More

    Submitted 8 February, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

    Comments: CIDR'22

  6. arXiv:2109.05173  [pdf, other

    cs.DB cs.HC cs.LG

    Making Table Understanding Work in Practice

    Authors: Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Çağatay Demiralp

    Abstract: Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap b… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: Submitted to CIDR'22

  7. arXiv:2106.07258  [pdf, other

    cs.DB cs.LG

    GitTables: A Large-Scale Corpus of Relational Tables

    Authors: Madelon Hulsebos, Çağatay Demiralp, Paul Groth

    Abstract: The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, w… ▽ More

    Submitted 12 April, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

  8. arXiv:1911.06311  [pdf, other

    cs.DB cs.CL cs.LG

    Sato: Contextual Semantic Type Detection in Tables

    Authors: Dan Zhang, Yoshihiko Suhara, **feng Li, Madelon Hulsebos, Çağatay Demiralp, Wang-Chiew Tan

    Abstract: Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely… ▽ More

    Submitted 3 June, 2020; v1 submitted 14 November, 2019; originally announced November 2019.

    Comments: VLDB'20

  9. arXiv:1905.10688  [pdf, other

    cs.LG cs.DB cs.IR stat.ML

    Sherlock: A Deep Learning Approach to Semantic Data Type Detection

    Authors: Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

    Abstract: Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number o… ▽ More

    Submitted 25 May, 2019; originally announced May 2019.

    Comments: KDD'19

  10. arXiv:1905.04616  [pdf, other

    cs.HC cs.DB cs.LG

    VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

    Authors: Kevin Hu, Neil Gaikwad, Michiel Bakker, Madelon Hulsebos, Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satyanarayan, Çağatay Demiralp

    Abstract: Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data rep… ▽ More

    Submitted 11 May, 2019; originally announced May 2019.

    Comments: CHI'19

  11. arXiv:1705.04379  [pdf, ps, other

    stat.ML cs.LG

    The Network Nullspace Property for Compressed Sensing of Big Data over Networks

    Authors: Alexander Jung, Madelon Hulsebos

    Abstract: We present a novel condition, which we term the net- work nullspace property, which ensures accurate recovery of graph signals representing massive network-structured datasets from few signal values. The network nullspace property couples the cluster structure of the underlying network-structure with the geometry of the sampling set. Our results can be used to design efficient sampling strategies… ▽ More

    Submitted 13 March, 2018; v1 submitted 11 May, 2017; originally announced May 2017.