Search | arXiv e-print repository

arXiv:2206.11992 [pdf]

The LBNL Superfacility Project Report

Authors: Deborah Bard, Cory Snavely, Lisa Gerhardt, Jason Lee, Becci Totzke, Katie Antypas, William Arndt, Johannes Blaschke, Suren Byna, Ravi Cheema, Shreyas Cholia, Mark Day, Bjoern Enders, Aditi Gaur, Annette Greiner, Taylor Groves, Mariam Kiran, Quincey Koziol, Tom Lehman, Kelly Rowland, Chris Samuel, Ashwin Selvarajan, Alex Sim, David Skinner, Laurie Stephey , et al. (2 additional authors not shown)

Abstract: The Superfacility model is designed to leverage HPC for experimental science. It is more than simply a model of connected experiment, network, and HPC facilities; it encompasses the full ecosystem of infrastructure, software, tools, and expertise needed to make connected facilities easy to use. The three-year Lawrence Berkeley National Laboratory (LBNL) Superfacility project was initiated in 2019… ▽ More The Superfacility model is designed to leverage HPC for experimental science. It is more than simply a model of connected experiment, network, and HPC facilities; it encompasses the full ecosystem of infrastructure, software, tools, and expertise needed to make connected facilities easy to use. The three-year Lawrence Berkeley National Laboratory (LBNL) Superfacility project was initiated in 2019 to coordinate work being performed at LBNL to support this model, and to provide a coherent and comprehensive set of science requirements to drive existing and new work. A key component of the project was the in-depth engagements with eight science teams that represent challenging use cases across the DOE Office of Science. By the close of the project, we met our project goal by enabling our science application engagements to demonstrate automated pipelines that analyze data from remote facilities at large scale, without routine human intervention. In several cases, we have gone beyond demonstrations and now provide production-level services. To achieve this goal, the Superfacility team developed tools, infrastructure, and policies for near-real-time computing support, dynamic high-performance networking, data management and movement tools, API-driven automation, HPC-scale notebooks via Jupyter, authentication using Federated Identity and container-based edge services supported. The lessons we learned during this project provide a valuable model for future large, complex, cross-disciplinary collaborations. There is a pressing need for a coherent computing infrastructure across national facilities, and LBNL's Superfacility project is a unique model for success in tackling the challenges that will be faced in hardware, software, policies, and services across multiple science domains. △ Less

Submitted 27 June, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

Comments: 85 pages, 23 figures

Report number: UCPMS ID: 3815358 UCPMS ID: 3815358 UCPMS ID: 3815358 UCPMS ID: 3815358UCPMS ID: 3815358 UCPMS ID: 3815358

arXiv:2205.01168 [pdf, other]

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

Authors: Sunwoo Lee, Kai-yuan Hou, Kewei Wang, Saba Sehrish, Marc Paterno, James Kowalkowski, Quincey Koziol, Robert Ross, Ankit Agrawal, Alok Choudhary, Wei-keng Liao

Abstract: In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous… ▽ More In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task. △ Less

Submitted 2 May, 2022; originally announced May 2022.

arXiv:2105.12929 [pdf, other]

Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

Authors: Bo Fang, Daoce Wang, Sian **, Quincey Koziol, Zhao Zhang, Qiang Guan, Suren Byna, Sriram Krishnamoorthy, Dingwen Tao

Abstract: In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the… ▽ More In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications. △ Less

Submitted 2 August, 2021; v1 submitted 26 May, 2021; originally announced May 2021.

Comments: 12 pages, 9 figures, 4 tables, accepted by IEEE Cluster'21

arXiv:2007.01789 [pdf, other]

Map** Datasets to Object Storage System

Authors: Xiaowei, Chu, Jeff LeFevre, Aldrin Montana, Dana Robinson, Quincey Koziol, Peter Alvaro, Carlos Maltzahn

Abstract: Access libraries such as ROOT and HDF5 allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. The situation is getting worse with… ▽ More Access libraries such as ROOT and HDF5 allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset map** infrastructures that can integrate and scale out existing access libraries using Ceph's extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset map** techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset map** can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Journal ref: In 24th International Conference on Computing in High Energy & Nuclear Physics, Adelaide, Australia, November 4-8 2019

arXiv:1712.00423 [pdf, other]

DAOS for Extreme-scale Systems in Scientific Applications

Authors: M. Scot Breitenfeld, Neil Fortner, Jordan Henderson, Jerome Soumagne, Mohamad Chaarawi, Johann Lombardi, Quincey Koziol

Abstract: Exascale I/O initiatives will require new and fully integrated I/O models which are capable of providing straightforward functionality, fault tolerance and efficiency. One solution is the Distributed Asynchronous Object Storage (DAOS) technology, which is primarily designed to handle the next generation NVRAM and NVMe technologies envisioned for providing a high bandwidth/IOPS storage tier close t… ▽ More Exascale I/O initiatives will require new and fully integrated I/O models which are capable of providing straightforward functionality, fault tolerance and efficiency. One solution is the Distributed Asynchronous Object Storage (DAOS) technology, which is primarily designed to handle the next generation NVRAM and NVMe technologies envisioned for providing a high bandwidth/IOPS storage tier close to the compute nodes in an HPC system. In conjunction with DAOS, the HDF5 library, an I/O library for scientific applications, will support end-to-end data integrity, fault tolerance, object map**, index building and querying. This paper details the implementation and performance of the HDF5 library built over DAOS by using three representative scientific application codes. △ Less

Submitted 1 December, 2017; originally announced December 2017.

Comments: Submitted to HiPC-2017 on Jun 30 2017, accepted for publication on Sep 8 2017, withdrawn on Oct 20 2017 b/c no author was able to present

arXiv:1510.02135 [pdf, other]

A Remote Procedure Call Approach for Extreme-scale Services

Authors: Jerome Soumagne, Philip H. Carns, Dries Kimpe, Quincey Koziol, Robert B. Ross

Abstract: When working at exascale, the various constraints imposed by the extreme scale of the system bring new challenges for application users and software/middleware developers. In that context, and to provide best performance, resiliency and energy efficiency, software may be provided as a service oriented approach, adjusting resource utilization to best meet facility and user requirements. Remote proc… ▽ More When working at exascale, the various constraints imposed by the extreme scale of the system bring new challenges for application users and software/middleware developers. In that context, and to provide best performance, resiliency and energy efficiency, software may be provided as a service oriented approach, adjusting resource utilization to best meet facility and user requirements. Remote procedure call (RPC) is a technique that originally followed a client/server model and allowed local calls to be transparently executed on remote resources. RPC consists of serializing the local function parameters into a memory buffer and sending that buffer to a remote target that in turn deserializes the parameters and executes the corresponding function call, returning the result back to the caller. Building reusable services requires the definition of a communication model to remotely access these services and for this purpose, RPC can serve as a foundation for accessing them. We introduce the necessary building blocks to enable this ecosystem to software and middleware developers with an RPC framework called Mercury. △ Less

Submitted 5 October, 2015; originally announced October 2015.

Comments: CSESSP 2015

Showing 1–6 of 6 results for author: Koziol, Q