Search | arXiv e-print repository

arXiv:2003.09758 [pdf, other]

ARDA: Automatic Relational Data Augmentation for Machine Learning

Authors: Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

Abstract: Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmen… ▽ More Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets. △ Less

Submitted 21 March, 2020; originally announced March 2020.

arXiv:1905.10688 [pdf, other]

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Authors: Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

Abstract: Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number o… ▽ More Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations. △ Less

Submitted 25 May, 2019; originally announced May 2019.

Comments: KDD'19

arXiv:1905.04616 [pdf, other]

VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

Authors: Kevin Hu, Neil Gaikwad, Michiel Bakker, Madelon Hulsebos, Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satyanarayan, Çağatay Demiralp

Abstract: Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data rep… ▽ More Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries. On average, these datasets comprise 17 records over 3 dimensions and across the corpus, we find 51% of the dimensions record categorical data, 44% quantitative, and only 5% temporal. VizNet provides the necessary common baseline for comparing visualization design techniques, and develo** benchmark models and algorithms for automating visual analysis. To demonstrate VizNet's utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the influence of user task and data distribution on visual encoding effectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual effectiveness can be learned from experimental results, and show its predictive power across test datasets. △ Less

Submitted 11 May, 2019; originally announced May 2019.

Comments: CHI'19

arXiv:1901.10875 [pdf, other]

STAR: Statistical Tests with Auditable Results

Authors: Sacha Servan-Schreiber, Olga Ohrimenko, Tim Kraska, Emanuel Zgraggen

Abstract: We present STAR: a novel system aimed at solving the complex issue of "p-hacking" and false discoveries in scientific studies. STAR provides a concrete way for ensuring the application of false discovery control procedures in hypothesis testing, using mathematically provable guarantees, with the goal of reducing the risk of data dredging. STAR generates an efficiently auditable certificate which a… ▽ More We present STAR: a novel system aimed at solving the complex issue of "p-hacking" and false discoveries in scientific studies. STAR provides a concrete way for ensuring the application of false discovery control procedures in hypothesis testing, using mathematically provable guarantees, with the goal of reducing the risk of data dredging. STAR generates an efficiently auditable certificate which attests to the validity of each statistical test performed on a dataset. STAR achieves this by using several cryptographic techniques which are combined specifically for this purpose. Under-the-hood, STAR uses a decentralized set of authorities (e.g., research institutions), secure computation techniques, and an append-only ledger which together enable auditing of scientific claims by 3rd parties and matches real world trust assumptions. We implement and evaluate a construction of STAR using the Microsoft SEAL encryption library and SPDZ multi-party computation protocol. Our experimental evaluation demonstrates the practicality of STAR in multiple real world scenarios as a system for certifying scientific discoveries in a tamper-proof way. △ Less

Submitted 23 October, 2019; v1 submitted 19 January, 2019; originally announced January 2019.

arXiv:1804.02593 [pdf, other]

IDEBench: A Benchmark for Interactive Data Exploration

Authors: Philipp Eichmann, Carsten Binnig, Tim Kraska, Emanuel Zgraggen

Abstract: Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where… ▽ More Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where most queries are ad-hoc, not based on predefined reports, and built incrementally. As a main contribution, we present a novel benchmark called IDEBench that can be used to evaluate the performance of database systems for IDE workloads. As opposed to traditional benchmarks for analytical database systems, our goal is to provide more meaningful workloads and datasets that can be used to benchmark IDE query engines, with a particular focus on metrics that capture the trade-off between query performance and quality of the result. As a second contribution, this paper evaluates and discusses the performance results of selected IDE query engines using our benchmark. The study includes two commercial systems, as well as two research prototypes (IDEA, approXimateDB/XDB), and one traditional analytical database system (MonetDB). △ Less

Submitted 7 April, 2018; originally announced April 2018.

arXiv:1612.01040 [pdf, other]

Controlling False Discoveries During Interactive Data Exploration

Authors: Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska

Abstract: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypot… ▽ More Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypothesis testing control into interactive data exploration tools. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable for interactive data exploration. We therefore discuss a set of new control procedures that are better suited and integrated them in our system called Aware. By means of extensive experiments using both real-world and synthetic data sets we demonstrate how Aware can help experts and novice users alike to efficiently control false discoveries. △ Less

Submitted 3 December, 2016; originally announced December 2016.

Showing 1–6 of 6 results for author: Zgraggen, E