Search | arXiv e-print repository

An Empirical Evaluation of Columnar Storage Formats

Authors: Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, Huanchen Zhang

Abstract: Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both… ▽ More Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends. △ Less

Submitted 7 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 15 pages; typos corrected, missing figure legend added

arXiv:2004.14471 [pdf, other]

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

Authors: Tianyu Li, Matthew Butrovich, Amadou Ngom, Wan Shen Lim, Wes McKinney, Andrew Pavlo

Abstract: The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) s… ▽ More The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. We aim to reduce or even eliminate this process by develo** a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the DB-X DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Comments: 16 pages

arXiv:cond-mat/9911382 [pdf]

The First Synchrotron Infrared Beamlines at the ALS: Spectromicroscopy and Fast Timing

Authors: Michael C. Martin, Wayne R. McKinney

Abstract: Two recently commissioned infrared beamlines on the 1.4 bending magnet port at the Advanced Light Source, LBNL, are described. Using a synchrotron as an IR source provides three primary advantages: increased brightness, very fast light pulses, and enhanced far-IR flux. The considerable brightness advantage manifests itself most beneficially when performing spectroscopy on a microscopic length sc… ▽ More Two recently commissioned infrared beamlines on the 1.4 bending magnet port at the Advanced Light Source, LBNL, are described. Using a synchrotron as an IR source provides three primary advantages: increased brightness, very fast light pulses, and enhanced far-IR flux. The considerable brightness advantage manifests itself most beneficially when performing spectroscopy on a microscopic length scale. Beamline (BL) 1.4.3 is a dedicated FTIR spectromicroscopy beamline, where a diffraction-limited spot size using the synchrotron source is utilized. BL 1.4.2 consists of a vacuum FTIR bench with a wide spectral range and step-scan capability. This BL makes use of the pulsed nature of the synchrotron light as well as the far-IR flux. Fast timing is demonstrated by observing the pulses from the electron bunch storage pattern at the ALS. Results from several experiments from both IR beamlines will be presented as an overview of the IR research currently being done at the ALS. △ Less

Submitted 23 November, 1999; originally announced November 1999.

Comments: 10 pages, PDF format including figures. To be published in Ferroelectrics

Report number: LBNL-44208

Showing 1–3 of 3 results for author: McKinney, W