Search | arXiv e-print repository

Kishu: Time-Traveling for Computational Notebooks

Authors: Zhaoheng Li, Supawit Chockchowwat, Ribhav Sahu, Areet Sheth, Yongjoo Park

Abstract: Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session stat… ▽ More Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session state consisting of user-defined variables can be irreversibly modified - e.g., the user cannot 'un-drop' a dataframe column. This is because, unlike DBMS, existing notebook systems do not keep track of the session state. Existing techniques for checkpointing and restoring session states, such as OS-level memory snapshot or application-level session dump, are insufficient: checkpointing can incur prohibitive storage costs and may fail, while restoration can only be inefficiently performed from scratch by fully loading checkpoint files. In this paper, we introduce a new notebook system, Kishu, that offers time-traveling to and from arbitrary notebook states using an efficient and fault-tolerant incremental checkpoint and checkout mechanism. Kishu creates incremental checkpoints that are small and correctly preserve complex inter-variable dependencies at a novel Co-variable granularity. Then, to return to a previous state, Kishu accurately identifies the state difference between the current and target states to perform incremental checkout at sub-second latency with minimal data loading. Kishu is compatible with 146 object classes from popular data science libraries (e.g., Ray, Spark, PyTorch), and reduces checkpoint size and checkout time by up to 4.55x and 9.02x, respectively, on a variety of notebooks. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2306.14395 [pdf, other]

doi 10.1145/3617308

AirIndex: Versatile Index Tuning Through Data and Storage

Authors: Supawit Chockchowwat, Wenjie Liu, Yongjoo Park

Abstract: The end-to-end lookup latency of a hierarchical index -- such as a B-tree or a learned index -- is determined by its structure such as the number of layers, the kinds of branching functions appearing in each layer, the amount of data we must fetch from layers, etc. Our primary observation is that by optimizing those structural parameters (or designs) specifically to a target system's I/O character… ▽ More The end-to-end lookup latency of a hierarchical index -- such as a B-tree or a learned index -- is determined by its structure such as the number of layers, the kinds of branching functions appearing in each layer, the amount of data we must fetch from layers, etc. Our primary observation is that by optimizing those structural parameters (or designs) specifically to a target system's I/O characteristics (e.g., latency, bandwidth), we can offer a faster lookup compared to the ones that are not optimized. Can we develop a systematic method for finding those optimal design parameters? Ideally, the method must have the potential to generate almost any existing index or a novel combination of them for the fastest possible lookup. In this work, we present new data and an I/O-aware index builder (called AirIndex) that can find high-speed hierarchical index designs in a principled way. Specifically, AirIndex minimizes an objective function expressing the end-to-end latency in terms of various designs -- the number of layers, types of layers, and more -- for given data and a storage profile, using a graph-based optimization method purpose-built to address the computational challenges rising from the inter-dependencies among index layers and the exponentially many candidate parameters in a large search space. Our empirical studies confirm that AirIndex can find optimal index designs, build optimal indexes within the times comparable to existing methods, and deliver up to 4.1x faster lookup than a lightweight B-tree library (LMDB), 3.3x--46.3x faster than state-of-the-art learned indexes (RMI/CDFShop, PGM-Index, ALEX/APEX, PLEX), and 2.0 faster than Data Calculator's suggestion on various dataset and storage settings. △ Less

Submitted 1 September, 2023; v1 submitted 25 June, 2023; originally announced June 2023.

Comments: 13 pages, 3 appendices, 19 figures, to appear at SIGMOD 2024

arXiv:2305.08770 [pdf, other]

doi 10.1145/3595360.3595855

Transactional Python for Durable Machine Learning: Vision, Challenges, and Feasibility

Authors: Supawit Chockchowwat, Zhaoheng Li, Yongjoo Park

Abstract: In machine learning (ML), Python serves as a convenient abstraction for working with key libraries such as PyTorch, scikit-learn, and others. Unlike DBMS, however, Python applications may lose important data, such as trained models and extracted features, due to machine failures or human errors, leading to a waste of time and resources. Specifically, they lack four essential properties that could… ▽ More In machine learning (ML), Python serves as a convenient abstraction for working with key libraries such as PyTorch, scikit-learn, and others. Unlike DBMS, however, Python applications may lose important data, such as trained models and extracted features, due to machine failures or human errors, leading to a waste of time and resources. Specifically, they lack four essential properties that could make ML more reliable and user-friendly -- durability, atomicity, replicability, and time-versioning (DART). This paper presents our vision of Transactional Python that provides DART without any code modifications to user programs or the Python kernel, by non-intrusively monitoring application states at the object level and determining a minimal amount of information sufficient to reconstruct a whole application. Our evaluation of a proof-of-concept implementation with public PyTorch and scikit-learn applications shows that DART can be offered with overheads ranging 1.5%--15.6%. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 5 pages, 5 figures, to appear at DEEM 2023

arXiv:2303.04103 [pdf, other]

A Step Toward Deep Online Aggregation (Extended Version)

Authors: Nikhil Sheoran, Supawit Chockchowwat, Arav Chheda, Suwen Wang, Riya Verma, Yongjoo Park

Abstract: For exploratory data analysis, it is often desirable to know what answers you are likely to get before actually obtaining those answers. This can potentially be achieved by designing systems to offer the estimates of a data operation result -- say op(data) -- earlier in the process based on partial data processing. Those estimates continuously refine as more data is processed and finally converge… ▽ More For exploratory data analysis, it is often desirable to know what answers you are likely to get before actually obtaining those answers. This can potentially be achieved by designing systems to offer the estimates of a data operation result -- say op(data) -- earlier in the process based on partial data processing. Those estimates continuously refine as more data is processed and finally converge to the exact answer. Unfortunately, the existing techniques -- called Online Aggregation (OLA) -- are limited to a single operation; that is, we cannot obtain the estimates for op(op(data)) or op(...(op(data))). If this Deep OLA becomes possible, data analysts will be able to explore data more interactively using complex cascade operations. In this work, we take a step toward Deep OLA with evolving data frames (edf), a novel data model to offer OLA for nested ops -- op(...(op(data))) -- by representing an evolving structured data (with converging estimates) that is closed under set operations. That is, op(edf) produces yet another edf; thus, we can freely apply successive operations to edf and obtain an OLA output for each op. We evaluate its viability with Wake, an edf-based OLA system, by examining against state-of-the-art OLA and non-OLA systems. In our experiments on TPC-H dataset, Wake produces its first estimates 4.93x faster (median) -- with 1.3x median slowdown for exact answers -- compared to conventional systems. Besides its generality, Wake is also 1.92x faster (median) than existing OLA systems in producing estimates of under 1% relative errors. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: 16 pages, 13 figures, 3 appendices, to appear at SIGMOD 2023

arXiv:2208.03823 [pdf, other]

Automatically Finding Optimal Index Structure

Authors: Supawit Chockchowwat, Wenjie Liu, Yongjoo Park

Abstract: Existing learned indexes (e.g., RMI, ALEX, PGM) optimize the internal regressor of each node, not the overall structure such as index height, the size of each layer, etc. In this paper, we share our recent findings that we can achieve significantly faster lookup speed by optimizing the structure as well as internal regressors. Specifically, our approach (called AirIndex) expresses the end-to-end l… ▽ More Existing learned indexes (e.g., RMI, ALEX, PGM) optimize the internal regressor of each node, not the overall structure such as index height, the size of each layer, etc. In this paper, we share our recent findings that we can achieve significantly faster lookup speed by optimizing the structure as well as internal regressors. Specifically, our approach (called AirIndex) expresses the end-to-end lookup time as a novel objective function, and searches for optimal design decisions using a purpose-built optimizer. In our experiments with state-of-the-art methods, AirIndex achieves 3.3x-7.7x faster lookup for the data stored on local SSD, and 1.4x-3.0x faster lookup for the data on Azure Cloud Storage. △ Less

Submitted 7 August, 2022; originally announced August 2022.

Comments: 5 pages, to be published in AIDB at VLDB 2022

arXiv:2206.12959 [pdf, other]

Probabilistic PolarGMM: Unsupervised Cluster Learning of Very Noisy Projection Images of Unknown Pose

Authors: Supawit Chockchowwat, Chandrajit L. Bajaj

Abstract: A crucial step in single particle analysis (SPA) of cryogenic electron microscopy (Cryo-EM), 2D classification and alignment takes a collection of noisy particle images to infer orientations and group similar images together. Averaging these aligned and clustered noisy images produces a set of clean images, ready for further analysis such as 3D reconstruction. Fourier-Bessel steerable principal co… ▽ More A crucial step in single particle analysis (SPA) of cryogenic electron microscopy (Cryo-EM), 2D classification and alignment takes a collection of noisy particle images to infer orientations and group similar images together. Averaging these aligned and clustered noisy images produces a set of clean images, ready for further analysis such as 3D reconstruction. Fourier-Bessel steerable principal component analysis (FBsPCA) enables an efficient, adaptable, low-rank rotation operator. We extend the FBsPCA to additionally handle translations. In this extended FBsPCA representation, we use a probabilistic polar-coordinate Gaussian mixture model to learn soft clusters in an unsupervised fashion using an expectation maximization (EM) algorithm. The obtained rotational clusters are thus additionally robust to the presence of pairwise alignment imperfections. Multiple benchmarks from simulated Cryo-EM datasets show probabilistic PolarGMM's improved performance in comparisons with standard single-particle Cryo-EM tools, EMAN2 and RELION, in terms of various clustering metrics and alignment errors. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: 13 pages, including appendices

arXiv:2112.13323 [pdf, other]

Airphant: Cloud-oriented Document Indexing

Authors: Supawit Chockchowwat, Chaitanya Sood, Yongjoo Park

Abstract: Modern data warehouses can scale compute nodes independently of storage. These systems persist their data on cloud storage, which is always available and cost-efficient. Ad-hoc compute nodes then fetch necessary data on-demand from cloud storage. This ability to quickly scale or shrink data systems is highly beneficial if query workloads may change over time. We apply this new architecture to sear… ▽ More Modern data warehouses can scale compute nodes independently of storage. These systems persist their data on cloud storage, which is always available and cost-efficient. Ad-hoc compute nodes then fetch necessary data on-demand from cloud storage. This ability to quickly scale or shrink data systems is highly beneficial if query workloads may change over time. We apply this new architecture to search engines with a focus on optimizing their latencies in cloud environments. However, simply placing existing search engines (e.g., Apache Lucene) on top of cloud storage significantly increases their end-to-end query latencies (i.e., more than 6 seconds on average in one of our studies). This is because their indexes can incur multiple network round-trips due to their hierarchical structure (e.g., skip lists, B-trees, learned indexes). To address this issue, we develop a new statistical index (called IoU Sketch). For lookup, IoU Sketch makes multiple asynchronous network requests in parallel. While IoU Sketch may fetch more bytes than existing indexes, it significantly reduces the index lookup time because parallel requests do not block each other. Based on IoU Sketch, we build an end-to-end search engine, called Airphant; we describe how Airphant builds, optimizes, and manages IoU Sketch; and ultimately, supports keyword-based querying. In our experiments with four real datasets, Airphant's average end-to-end latencies are between 13 milliseconds and 300 milliseconds, being up to 8.97x faster than Apache Lucence and 113.39x faster than Elasticsearch. △ Less

Submitted 26 December, 2021; originally announced December 2021.

Comments: 17 pages, to be published in ICDE 2022

arXiv:2106.10408 [pdf, other]

Step Out of Your Comfort Zone: More Inclusive Content Recommendation for Networked Systems

Authors: Jiaxin Wu, Supawit Chockchowwat

Abstract: Networked systems are widely applicable in real-world scenarios such as social networks, infrastructure networks, and biological networks. Among those applications, we are interested in social networks due to their complexity and popularity. One crucial task on the social network is to recommend new content based on special characteristics of the graph structure. In this project, we aim to enhance… ▽ More Networked systems are widely applicable in real-world scenarios such as social networks, infrastructure networks, and biological networks. Among those applications, we are interested in social networks due to their complexity and popularity. One crucial task on the social network is to recommend new content based on special characteristics of the graph structure. In this project, we aim to enhance the recommender systems by preventing the recommendations from leaning towards contents from closed communities. To counteract the bias, we will consider information dissemination across network as a metric to assess the recommendation for contents e.g. new connections and news feed. We use academic collaboration network and user-item interaction datasets from Yelp to simulate an environment for connection recommendations and to validate the proposed algorithm. △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:1906.01408 [pdf, other]

Hypothesis-Driven Skill Discovery for Hierarchical Deep Reinforcement Learning

Authors: Caleb Chuck, Supawit Chockchowwat, Scott Niekum

Abstract: Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation. However, standard DRL methods often suffer from poor sample efficiency, partially because they aim to be entirely problem-agnostic. In this work, we introduce a novel approach to exploration and hierarchical skill learni… ▽ More Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation. However, standard DRL methods often suffer from poor sample efficiency, partially because they aim to be entirely problem-agnostic. In this work, we introduce a novel approach to exploration and hierarchical skill learning that derives its sample efficiency from intuitive assumptions it makes about the behavior of objects both in the physical world and simulations which mimic physics. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which discovers objects from raw pixel data, generates hypotheses about the controllability of observed changes in object state, and learns a hierarchy of skills to test these hypotheses. We demonstrate that HyPE can dramatically improve the sample efficiency of policy learning in two different domains: a simulated robotic block-pushing domain, and a popular benchmark task: Breakout. In these domains, HyPE learns high-scoring policies an order of magnitude faster than several state-of-the-art reinforcement learning methods. △ Less

Submitted 3 March, 2020; v1 submitted 27 May, 2019; originally announced June 2019.

Comments: Submitted to IROS 2020

arXiv:1804.07284 [pdf, other]

doi 10.1145/3205455.3205635

Functional Generative Design: An Evolutionary Approach to 3D-Printing

Authors: Cem C. Tutum, Supawit Chockchowwat, Etienne Vouga, Risto Miikkulainen

Abstract: Consumer-grade printers are widely available, but their ability to print complex objects is limited. Therefore, new designs need to be discovered that serve the same function, but are printable. A representative such problem is to produce a working, reliable mechanical spring. The proposed methodology for discovering solutions to this problem consists of three components: First, an effective searc… ▽ More Consumer-grade printers are widely available, but their ability to print complex objects is limited. Therefore, new designs need to be discovered that serve the same function, but are printable. A representative such problem is to produce a working, reliable mechanical spring. The proposed methodology for discovering solutions to this problem consists of three components: First, an effective search space is learned through a variational autoencoder (VAE); second, a surrogate model for functional designs is built; and third, a genetic algorithm is used to simultaneously update the hyperparameters of the surrogate and to optimize the designs using the updated surrogate. Using a car-launcher mechanism as a test domain, spring designs were 3D-printed and evaluated to update the surrogate model. Two experiments were then performed: First, the initial set of designs for the surrogate-based optimizer was selected randomly from the training set that was used for training the VAE model, which resulted in an exploitative search behavior. On the other hand, in the second experiment, the initial set was composed of more uniformly selected designs from the same training set and a more explorative search behavior was observed. Both of the experiments showed that the methodology generates interesting, successful, and reliable spring geometries robust to the noise inherent in the 3D printing process. The methodology can be generalized to other functional design problems, thus making consumer-grade 3D printing more versatile. △ Less

Submitted 19 April, 2018; originally announced April 2018.

Comments: 8 pages, 12 figures, GECCO'18

Showing 1–10 of 10 results for author: Chockchowwat, S