Skip to main content

Showing 1–12 of 12 results for author: Byna, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19256  [pdf, other

    cs.AI

    AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI

    Authors: Kaveen Hiniduma, Suren Byna, Jean Luca Bez, Ravi Madduri

    Abstract: "Garbage In Garbage Out" is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest a considerable amount of time and effort in preparing the data for AI. However, there are no standard methods or frameworks for… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 12 pages, 9 figures, Accepted to SSDBM 2024

  2. arXiv:2404.10386  [pdf, other

    cs.DC cs.AI cs.LG

    I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

    Authors: Noah Lewis, Jean Luca Bez, Suren Byna

    Abstract: High-Performance Computing (HPC) systems excel in managing distributed workloads, and the growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. In the past, research on HPC I/O focused on optimizing the underlying storage system for modeling and simulation applications and checkpointing the resul… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    ACM Class: H.3.2; H.3.4; I.2.11

  3. arXiv:2404.05779  [pdf, other

    cs.LG cs.AI

    Data Readiness for AI: A 360-Degree Survey

    Authors: Kaveen Hiniduma, Suren Byna, Jean Luca Bez

    Abstract: Data are the critical fuel for Artificial Intelligence (AI) models. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Checking for data readiness is a crucial step in improving data quality. Numerous R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are st… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: 35 pages, 3 figures, 3 tables, submitted to ACM Computing Surveys

    ACM Class: I.2.0; E.m

  4. arXiv:2308.00891  [pdf, other

    cs.DC

    PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

    Authors: Runzhou Han, Mai Zheng, Suren Byna, Houjun Tang, Bin Dong, Dong Dai, Yong Chen, Dongkyun Kim, Joseph Hassoun, David Thorsley, Matthew Wolf

    Abstract: Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four represen… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

  5. arXiv:2207.09503  [pdf, other

    cs.DC

    A Comparison of HDF5, Zarr, and netCDF4 in Performing Common I/O Operations

    Authors: Sriniket Ambatipudi, Suren Byna

    Abstract: Scientific data is often stored in files because of the simplicity they provide in managing, transferring, and sharing data. These files are typically structured in a specific arrangement and contain metadata to understand the structure the data is stored in. There are numerous file formats in use in various scientific domains that provide abstractions for storing and retrieving data. With the abu… ▽ More

    Submitted 5 February, 2023; v1 submitted 19 July, 2022; originally announced July 2022.

    Comments: 6 pages, 12 figures

  6. arXiv:2206.14761  [pdf, other

    cs.DC cs.PF

    Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5

    Authors: Sian **, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, Franck Cappello

    Abstract: Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: 13 pages, 18 figures, accepted by ACM/IEEE SC'22

  7. arXiv:2206.11992  [pdf

    cs.DC

    The LBNL Superfacility Project Report

    Authors: Deborah Bard, Cory Snavely, Lisa Gerhardt, Jason Lee, Becci Totzke, Katie Antypas, William Arndt, Johannes Blaschke, Suren Byna, Ravi Cheema, Shreyas Cholia, Mark Day, Bjoern Enders, Aditi Gaur, Annette Greiner, Taylor Groves, Mariam Kiran, Quincey Koziol, Tom Lehman, Kelly Rowland, Chris Samuel, Ashwin Selvarajan, Alex Sim, David Skinner, Laurie Stephey , et al. (2 additional authors not shown)

    Abstract: The Superfacility model is designed to leverage HPC for experimental science. It is more than simply a model of connected experiment, network, and HPC facilities; it encompasses the full ecosystem of infrastructure, software, tools, and expertise needed to make connected facilities easy to use. The three-year Lawrence Berkeley National Laboratory (LBNL) Superfacility project was initiated in 2019… ▽ More

    Submitted 27 June, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

    Comments: 85 pages, 23 figures

    Report number: UCPMS ID: 3815358 UCPMS ID: 3815358 UCPMS ID: 3815358 UCPMS ID: 3815358UCPMS ID: 3815358 UCPMS ID: 3815358

  8. arXiv:2111.09815  [pdf, other

    cs.DB cs.DC

    Improving Prediction-Based Lossy Compression Dramatically via Ratio-Quality Modeling

    Authors: Sian **, Sheng Di, Jiannan Tian, Suren Byna, Dingwen Tao, Franck Cappello

    Abstract: Error-bounded lossy compression is one of the most effective techniques for scientific data reduction. However, the traditional trial-and-error approach used to configure lossy compressors for finding the optimal trade-off between reconstructed data quality and compression ratio is prohibitively expensive. To resolve this issue, we develop a general-purpose analytical ratio-quality model based on… ▽ More

    Submitted 5 May, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

    Comments: 14 pages, 14 figures, published by IEEE ICDE 2022

  9. arXiv:2105.12929  [pdf, other

    cs.DC

    Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

    Authors: Bo Fang, Daoce Wang, Sian **, Quincey Koziol, Zhao Zhang, Qiang Guan, Suren Byna, Sriram Krishnamoorthy, Dingwen Tao

    Abstract: In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the… ▽ More

    Submitted 2 August, 2021; v1 submitted 26 May, 2021; originally announced May 2021.

    Comments: 12 pages, 9 figures, 4 tables, accepted by IEEE Cluster'21

  10. arXiv:1702.08327  [pdf, ps, other

    cs.DB

    ArrayBridge: Interweaving declarative array processing with high-performance computing

    Authors: Haoyuan Xing, Sofoklis Floratos, Spyros Blanas, Suren Byna, Prabhat, Kesheng Wu, Paul Brown

    Abstract: Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in res… ▽ More

    Submitted 27 February, 2017; originally announced February 2017.

    Comments: 12 pages, 13 figures

    ACM Class: H.2.8

  11. PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

    Authors: Md. Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jialin Liu, Peter Sadowski, Evan Racah, Suren Byna, Craig Tull, Wahid Bhimji, Prabhat, Pradeep Dubey

    Abstract: Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited s… ▽ More

    Submitted 27 July, 2016; originally announced July 2016.

    Comments: 11 pages in PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures, Md. Mostofa Ali Patwary et.al., IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

  12. arXiv:1503.08482  [pdf, ps, other

    cs.DB

    Towards Exascale Scientific Metadata Management

    Authors: Spyros Blanas, Surendra Byna

    Abstract: Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines hav… ▽ More

    Submitted 29 March, 2015; originally announced March 2015.