Skip to main content

Showing 1–50 of 68 results for author: Kraska, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14696  [pdf, other

    cs.CL cs.AI cs.DB

    A Declarative System for Optimizing AI Workloads

    Authors: Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Gerardo Vitagliano

    Abstract: A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 29 pages, 9 figures

    ACM Class: H.2.3; I.2.5

  2. arXiv:2403.05676  [pdf, other

    cs.CL

    PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

    Authors: Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska

    Abstract: Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2403.02286  [pdf, other

    cs.DB

    Stage: Query Execution Time Prediction in Amazon Redshift

    Authors: Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrishnan Narayanaswamy, Tim Kraska

    Abstract: Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduli… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 15 pages

  4. arXiv:2310.04830  [pdf, other

    cs.DB cs.CV cs.LG

    Extract-Transform-Load for Video Streams

    Authors: Ferdinand Kossmann, Ziniu Wu, Eugenie Lai, Nesime Tatbul, Lei Cao, Tim Kraska, Samuel Madden

    Abstract: Social media, self-driving cars, and traffic cameras produce video streams at large scales and cheap cost. However, storing and querying video at such scales is prohibitively expensive. We propose to treat large-scale video analytics as a data warehousing problem: Video is a format that is easy to produce but needs to be transformed into an application-specific format that is easy to query. Analog… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: 26 pages, 23 figures

    Journal ref: Proc. VLDB Endow. 16, 9 (May 2023), 2302-2315

  5. arXiv:2310.00749  [pdf, other

    cs.DB cs.LG

    SEED: Domain-Specific Data Curation With Large Language Models

    Authors: Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella

    Abstract: Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or… ▽ More

    Submitted 24 April, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: preprint, 20 pages, 4 figures

  6. arXiv:2305.05671  [pdf, other

    cs.DB cs.DS

    Parallel External Sorting of ASCII Records Using Learned Models

    Authors: Ani Kristo, Tim Kraska

    Abstract: External sorting is at the core of many operations in large-scale database systems, such as ordering and aggregation queries for large result sets, building indexes, sort-merge joins, duplicate removal, sharding, and record clustering. Unlike in-memory sorting, these algorithms need to work together with the OS and the filesystem to efficiently utilize system resources and minimize disk I/O. In… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

  7. arXiv:2212.05526  [pdf, other

    cs.DB cs.LG

    FactorJoin: A New Cardinality Estimation Framework for Join Queries

    Authors: Ziniu Wu, Parimarjan Negi, Mohammad Alizadeh, Tim Kraska, Samuel Madden

    Abstract: Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long plann… ▽ More

    Submitted 11 December, 2022; originally announced December 2022.

    Comments: Paper accepted by SIGMOD 2023

  8. arXiv:2205.05769  [pdf, other

    cs.DB cs.LG

    LSI: A Learned Secondary Index Structure

    Authors: Andreas Kipf, Dominik Horn, Pascal Pfeil, Ryan Marcus, Tim Kraska

    Abstract: Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2022)

  9. arXiv:2111.14905  [pdf, other

    cs.DB cs.LG

    Bounding the Last Mile: Efficient Learned String Indexing

    Authors: Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

    Abstract: We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

  10. arXiv:2111.08824  [pdf, other

    cs.DB

    The Case for Learned In-Memory Joins

    Authors: Ibrahim Sabek, Tim Kraska

    Abstract: In-memory join is an essential operator in any database engine. It has been extensively investigated in the database literature. In this paper, we study whether exploiting the CDF-based learned models to boost the join performance is practical or not. To the best of our knowledge, we are the first to fill this gap. We investigate the usage of CDF-based partitioning and learned indexes (e.g., Recur… ▽ More

    Submitted 9 March, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

    Comments: 18 pages, added more experimental evaluation results and technical details

  11. arXiv:2108.05117  [pdf, other

    cs.DB cs.LG

    Towards Practical Learned Indexing

    Authors: Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska

    Abstract: Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter $ε$ (maximum prediction error) and offers a better trade-off between build and lookup time than… ▽ More

    Submitted 6 November, 2021; v1 submitted 11 August, 2021; originally announced August 2021.

    Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

  12. arXiv:2107.03290  [pdf, other

    cs.DS

    Defeating duplicates: A re-design of the LearnedSort algorithm

    Authors: Ani Kristo, Kapil Vaidya, Tim Kraska

    Abstract: LearnedSort is a novel sorting algorithm that, unlike traditional methods, uses fast ML models to boost the sorting speed. The models learn to estimate the input's distribution and arrange the keys in sorted order by predicting their empirical cumulative distribution function (eCDF) values. LearnedSort has shown outstanding performance compared to state-of-the-art sorting algorithms on several dat… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

  13. arXiv:2107.01464  [pdf, other

    cs.DB

    When Are Learned Models Better Than Hash Functions?

    Authors: Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Tim Kraska

    Abstract: In this work, we aim to study when learned models are better hash functions, particular for hash-maps. We use lightweight piece-wise linear models to replace the hash functions as they have small inference times and are sufficiently general to capture complex distributions. We analyze the learned models in terms of: the model inference time and the number of collisions. Surprisingly, we found that… ▽ More

    Submitted 3 July, 2021; originally announced July 2021.

  14. LEA: A Learned Encoding Advisor for Column Stores

    Authors: Lu**g Cen, Andreas Kipf, Ryan Marcus, Tim Kraska

    Abstract: Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encodi… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

  15. arXiv:2103.13428  [pdf, other

    cs.CV cs.LG

    TagMe: GPS-Assisted Automatic Object Annotation in Videos

    Authors: Songtao He, Favyen Bastani, Mohammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, Sam Madden

    Abstract: Training high-accuracy object detection models requires large and diverse annotated datasets. However, creating these data-sets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new approach for automatic object annotation in videos that uses GPS data. When the GPS trace of an object is available, TagMe matches the object's motion from… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

    Comments: https://people.csail.mit.edu/songtao/tagme.html

  16. arXiv:2101.04964  [pdf, other

    cs.DB

    Flow-Loss: Learning Cardinality Estimates That Matter

    Authors: Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh

    Abstract: Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

  17. arXiv:2012.12501  [pdf, other

    cs.DB cs.DC cs.LG

    Learned Indexes for a Google-scale Disk-based Database

    Authors: Hussam Abu-Libdeh, Deniz Altınbüken, Alex Beutel, Ed H. Chi, Lyric Doshi, Tim Kraska, Xiaozhou, Li, Andy Ly, Christopher Olston

    Abstract: There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees. In this paper, we work to remove some of that uncertainty by demonstrating how a learned index can be integrated in a distributed, disk-based database system: Google's Bigtable. We detail several design decisions we made to integra… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

    Comments: 4 pages, Presented at Workshop on ML for Systems at NeurIPS 2020

  18. arXiv:2012.06683  [pdf, other

    cs.DB cs.IR

    Cortex: Harnessing Correlations to Boost Query Performance

    Authors: Vikram Nathan, Jialin Ding, Tim Kraska, Mohammad Alizadeh

    Abstract: Databases employ indexes to filter out irrelevant records, which reduces scan overhead and speeds up query execution. However, this optimization is only available to queries that filter on the indexed attribute. To extend these speedups to queries on other attributes, database systems have turned to secondary and multi-dimensional indexes. Unfortunately, these approaches are restrictive: secondary… ▽ More

    Submitted 11 December, 2020; originally announced December 2020.

    Comments: 13 pages, including references. Under submission

  19. arXiv:2007.11112  [pdf, other

    cs.OS cs.AR cs.DB cs.DC cs.NI

    DBOS: A Proposal for a Data-Centric Operating System

    Authors: Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia

    Abstract: Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state shou… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

  20. arXiv:2006.13282  [pdf, other

    cs.DB cs.LG

    Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

    Authors: Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska

    Abstract: Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Rec… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

  21. Benchmarking Learned Indexes

    Authors: Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska

    Abstract: Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can i… ▽ More

    Submitted 29 June, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

  22. arXiv:2006.05265  [pdf, other

    cs.LG cs.SE stat.ML

    MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure

    Authors: Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Niranjan Hasabnis, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich

    Abstract: Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses… ▽ More

    Submitted 2 June, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

    Comments: arXiv admin note: text overlap with arXiv:2003.11118

  23. arXiv:2006.03176  [pdf, other

    cs.DS cs.DB cs.LG

    Partitioned Learned Bloom Filter

    Authors: Kapil Vaidya, Eric Knorr, Tim Kraska, Michael Mitzenmacher

    Abstract: Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives. Recently, variations referred to as learned Bloom filters were developed that can provide improved performance in terms of the rate of false positives, by using a learned model for the represented set. However, previous methods for learned B… ▽ More

    Submitted 4 October, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: 13 pages, 3 figures

  24. ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

    Authors: Oscar Moll, Favyen Bastani, Sam Madden, Mike Stonebraker, Vijay Gadepally, Tim Kraska

    Abstract: Capturing and processing video is increasingly common as cameras become cheaper to deploy. At the same time, rich video understanding methods have progressed greatly in the last decade. As a result, many organizations now have massive repositories of video data, with applications in map**, navigation, autonomous driving, and other areas. Because state-of-the-art object detection methods are slow… ▽ More

    Submitted 12 August, 2022; v1 submitted 18 May, 2020; originally announced May 2020.

    Journal ref: 2022 IEEE 38th International Conference on Data Engineering (ICDE)

  25. Fast Map** onto Census Blocks

    Authors: Jeremy Kepner, Andreas Kipf, Darren Engwirda, Navin Vembar, Michael Jones, Lauren Milechin, Vijay Gadepally, Chris Hill, Tim Kraska, William Arcand, David Bestor, William Bergeron, Chansup Byun, Matthew Hubbell, Michael Houle, Andrew Kirby, Anna Klein, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Sid Samsi, Charles Yee, Peter Michaleas

    Abstract: Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longitude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled… ▽ More

    Submitted 1 August, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

    Comments: 8 pages, 7 figures, 55 references; accepted to IEEE HPEC 2020

  26. arXiv:2004.14541  [pdf, other

    cs.DB cs.LG

    RadixSpline: A Single-Pass Learned Index

    Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

    Abstract: Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that c… ▽ More

    Submitted 22 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2020)

  27. Learned Garbage Collection

    Authors: Lu**g Cen, Ryan Marcus, Hongzi Mao, Justin Gottschlich, Mohammad Alizadeh, Tim Kraska

    Abstract: Several programming languages use garbage collectors (GCs) to automatically manage memory for the programmer. Such collectors must decide when to look for unreachable objects to free, which can have a large performance impact on some applications. In this preliminary work, we propose a design for a learned garbage collector that autonomously learns over time when to perform collections. By using r… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

  28. Bao: Learning to Steer Query Optimizers

    Authors: Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska

    Abstract: Query optimization remains one of the most challenging problems in data management systems. Recent efforts to apply machine learning techniques to query optimization challenges have been promising, but have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties and drawing upon a long history of rese… ▽ More

    Submitted 8 April, 2020; originally announced April 2020.

  29. arXiv:2003.11118  [pdf, ps, other

    cs.PL cs.AI

    Context-Aware Parse Trees

    Authors: Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Paul Petersen, Jesmin Jahan Tithi, Tim Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich

    Abstract: The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}. This is a departure from the classical abstract syntax tree, which is principally driven by programming language syntax. While we believe a semantics-driven repres… ▽ More

    Submitted 24 March, 2020; originally announced March 2020.

  30. arXiv:2003.09758  [pdf, other

    cs.LG cs.DB stat.ML

    ARDA: Automatic Relational Data Augmentation for Machine Learning

    Authors: Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

    Abstract: Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmen… ▽ More

    Submitted 21 March, 2020; originally announced March 2020.

  31. arXiv:1912.01668  [pdf, other

    cs.DB cs.DS cs.LG

    Learning Multi-dimensional Indexes

    Authors: Vikram Nathan, Jialin Ding, Mohammad Alizadeh, Tim Kraska

    Abstract: Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsisten… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

  32. arXiv:1911.13014  [pdf, other

    cs.DB cs.DS cs.LG

    SOSD: A Benchmark for Learned Indexes

    Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

    Abstract: A groundswell of recent work has focused on improving data management systems with learned components. Specifically, work on learned index structures has proposed replacing traditional index structures, such as B-trees, with learned models. Given the decades of research committed to improving index structures, there is significant skepticism about whether learned indexes actually outperform state-… ▽ More

    Submitted 29 November, 2019; originally announced November 2019.

    Comments: NeurIPS 2019 Workshop on Machine Learning for Systems

  33. arXiv:1910.04728  [pdf, other

    cs.DB cs.DS cs.LG

    LISA: Towards Learned DNA Sequence Search

    Authors: Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska

    Abstract: Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. In this paper, we introduce LISA (Learned Indexes for Sequence… ▽ More

    Submitted 10 October, 2019; originally announced October 2019.

  34. arXiv:1905.10688  [pdf, other

    cs.LG cs.DB cs.IR stat.ML

    Sherlock: A Deep Learning Approach to Semantic Data Type Detection

    Authors: Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

    Abstract: Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number o… ▽ More

    Submitted 25 May, 2019; originally announced May 2019.

    Comments: KDD'19

  35. arXiv:1905.08898  [pdf, other

    cs.DB cs.DS cs.LG

    ALEX: An Updatable Adaptive Learned Index

    Authors: Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska

    Abstract: Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory… ▽ More

    Submitted 20 May, 2020; v1 submitted 21 May, 2019; originally announced May 2019.

    Report number: MSR-TR-2020-12

  36. arXiv:1905.04616  [pdf, other

    cs.HC cs.DB cs.LG

    VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

    Authors: Kevin Hu, Neil Gaikwad, Michiel Bakker, Madelon Hulsebos, Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satyanarayan, Çağatay Demiralp

    Abstract: Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data rep… ▽ More

    Submitted 11 May, 2019; originally announced May 2019.

    Comments: CHI'19

  37. Neo: A Learned Query Optimizer

    Authors: Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul

    Abstract: Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

  38. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  39. arXiv:1902.08291  [pdf, other

    cs.DB

    How I Learned to Stop Worrying and Love Re-optimization

    Authors: Matthew Perron, Zeyuan Shang, Tim Kraska, Michael Stonebraker

    Abstract: Cost-based query optimizers remain one of the most important components of database management systems for analytic workloads. Though modern optimizers select plans close to optimal performance in the common case, a small number of queries are an order of magnitude slower than they could be. In this paper we investigate why this is still the case, despite decades of improvements to cost models, pl… ▽ More

    Submitted 19 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

    Comments: Short version appearing in ICDE 2019

  40. arXiv:1901.10875  [pdf, other

    cs.CR stat.OT

    STAR: Statistical Tests with Auditable Results

    Authors: Sacha Servan-Schreiber, Olga Ohrimenko, Tim Kraska, Emanuel Zgraggen

    Abstract: We present STAR: a novel system aimed at solving the complex issue of "p-hacking" and false discoveries in scientific studies. STAR provides a concrete way for ensuring the application of false discovery control procedures in hypothesis testing, using mathematically provable guarantees, with the goal of reducing the risk of data dredging. STAR generates an efficiently auditable certificate which a… ▽ More

    Submitted 23 October, 2019; v1 submitted 19 January, 2019; originally announced January 2019.

  41. Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

    Authors: Erfan Zamanian, Julian Shun, Carsten Binnig, Tim Kraska

    Abstract: Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact… ▽ More

    Submitted 16 April, 2020; v1 submitted 29 November, 2018; originally announced November 2018.

  42. arXiv:1811.00602  [pdf, other

    cs.DB

    VizRec: A framework for secure data exploration via visual representation

    Authors: Lorenzo De Stefani, Leonhard F. Spiegelberg, Tim Kraska, Eli Upfal

    Abstract: Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. However, since visualizations tools are applied to sample data, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem… ▽ More

    Submitted 1 November, 2018; originally announced November 2018.

  43. arXiv:1808.08294  [pdf, other

    cs.LG stat.ML

    Unknown Examples & Machine Learning Model Generalization

    Authors: Yeounoh Chung, Peter J. Haas, Eli Upfal, Tim Kraska

    Abstract: Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, m… ▽ More

    Submitted 11 October, 2019; v1 submitted 24 August, 2018; originally announced August 2018.

  44. arXiv:1808.04819  [pdf, other

    cs.HC cs.AI cs.LG

    VizML: A Machine Learning Approach to Visualization Recommendation

    Authors: Kevin Z. Hu, Michiel A. Bakker, Stephen Li, Tim Kraska, César A. Hidalgo

    Abstract: Data visualization should be accessible for all analysts with data, not just the few with technical expertise. Visualization recommender systems aim to lower the barrier to exploring basic visualizations by automatically generating results for analysts to search and select, rather than manually specify. Here, we demonstrate a novel machine learning-based approach to visualization recommendation th… ▽ More

    Submitted 14 August, 2018; originally announced August 2018.

  45. arXiv:1807.06068  [pdf, other

    cs.DB cs.LG

    Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

    Authors: Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, Steven Euijong Whang

    Abstract: As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to hel** users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an i… ▽ More

    Submitted 6 January, 2019; v1 submitted 16 July, 2018; originally announced July 2018.

  46. arXiv:1806.03723  [pdf, other

    stat.ML cs.LG

    Smallify: Learning Network Size while Training

    Authors: Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, Samuel Madden

    Abstract: As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new tech… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

    Comments: 11 pages, 3 figures

  47. arXiv:1804.02593  [pdf, other

    cs.DB

    IDEBench: A Benchmark for Interactive Data Exploration

    Authors: Philipp Eichmann, Carsten Binnig, Tim Kraska, Emanuel Zgraggen

    Abstract: Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where… ▽ More

    Submitted 7 April, 2018; originally announced April 2018.

  48. FITing-Tree: A Data-aware Index Structure

    Authors: Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska

    Abstract: Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a… ▽ More

    Submitted 25 March, 2020; v1 submitted 30 January, 2018; originally announced January 2018.

    Comments: 18 pages

    Journal ref: SIGMOD (2019) 1189-1206

  49. SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

    Authors: Linnan Wang, **mian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, Tim Kraska

    Abstract: Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We pr… ▽ More

    Submitted 12 January, 2018; originally announced January 2018.

    Comments: PPoPP '2018: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

  50. arXiv:1712.01208  [pdf, other

    cs.DB cs.DS cs.NE

    The Case for Learned Index Structures

    Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis

    Abstract: Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be… ▽ More

    Submitted 30 April, 2018; v1 submitted 4 December, 2017; originally announced December 2017.