-
Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time
Authors:
Doris Xin,
Devin Petersohn,
Dixin Tang,
Yifan Wu,
Joseph E. Gonzalez,
Joseph M. Hellerstein,
Anthony D. Joseph,
Aditya G. Parameswaran
Abstract:
We propose opportunistic evaluation, a framework for accelerating interactions with dataframes. Interactive latency is critical for iterative, human-in-the-loop dataframe workloads for supporting exploratory data analysis. Opportunistic evaluation significantly reduces interactive latency by 1) prioritizing computation directly relevant to the interactions and 2) leveraging think time for asynchro…
▽ More
We propose opportunistic evaluation, a framework for accelerating interactions with dataframes. Interactive latency is critical for iterative, human-in-the-loop dataframe workloads for supporting exploratory data analysis. Opportunistic evaluation significantly reduces interactive latency by 1) prioritizing computation directly relevant to the interactions and 2) leveraging think time for asynchronous background computation for non-critical operators that might be relevant to future interactions. We show, through empirical analysis, that current user behavior presents ample opportunities for optimization, and the solutions we propose effectively harness such opportunities.
△ Less
Submitted 2 March, 2021;
originally announced March 2021.
-
Towards Scalable Dataframe Systems
Authors:
Devin Petersohn,
Stephen Macke,
Doris Xin,
William Ma,
Doris Lee,
Xiangxi Mo,
Joseph E. Gonzalez,
Joseph M. Hellerstein,
Anthony D. Joseph,
Aditya Parameswaran
Abstract:
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in…
▽ More
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature data-frame features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.
△ Less
Submitted 2 June, 2020; v1 submitted 3 January, 2020;
originally announced January 2020.
-
Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction
Authors:
Alyssa Morrow,
Vaishaal Shankar,
Devin Petersohn,
Anthony Joseph,
Benjamin Recht,
Nir Yosef
Abstract:
We present a simple and efficient method for prediction of transcription factor binding sites from DNA sequence. Our method computes a random approximation of a convolutional kernel feature map from DNA sequence and then learns a linear model from the approximated feature map. Our method outperforms state-of-the-art deep learning methods on five out of six test datasets from the ENCODE consortium,…
▽ More
We present a simple and efficient method for prediction of transcription factor binding sites from DNA sequence. Our method computes a random approximation of a convolutional kernel feature map from DNA sequence and then learns a linear model from the approximated feature map. Our method outperforms state-of-the-art deep learning methods on five out of six test datasets from the ENCODE consortium, while training in less than one eighth the time.
△ Less
Submitted 31 May, 2017;
originally announced June 2017.