Search | arXiv e-print repository

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Authors: Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov, Rafael Ferreira da Silva

Abstract: Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, t… ▽ More Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'23

MSC Class: 65Y05; 68P15 ACM Class: I.2; H.2; C.4; J.2

Journal ref: 19th IEEE International Conference on e-Science (eScience) 2023 - Limassol, Cyprus

arXiv:2212.09616 [pdf, other]

doi 10.1109/BigData55660.2022.10020380

Pseudonymization at Scale: OLCF's Summit Usage Data Case Study

Authors: Ketan Maheshwari, Sean R. Wilkinson, Alex May, Tyler Skluzacek, Olga A. Kuchar, Rafael Ferreira da Silva

Abstract: The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on H… ▽ More The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on HPC systems. A common approach to gathering data about user behavior is to analyze system log data available only to system administrators. Recently at Oak Ridge Leadership Computing Facility (OLCF), however, we unveiled user behavior about the Summit supercomputer by collecting data from a user's point of view with ordinary Unix commands. Here, we discuss the process, challenges, and lessons learned while preparing this dataset for publication and submission to an open data challenge. The original dataset contains personal identifiable information (PII) about OLCF users which needed be masked prior to publication, and we determined that anonymization, which scrubs PII completely, destroyed too much of the structure of the data to be interesting for the data challenge. We instead chose to pseudonymize the dataset to reduce its linkability to users' identities. Pseudonymization is significantly more computationally expensive than anonymization, and the size of our dataset, approximately 175 million lines of raw text, necessitated the development of a parallelized workflow that could be reused on different HPC machines. We demonstrate the scaling behavior of the workflow on two leadership class HPC systems at OLCF, and we show that we were able to bring the overall makespan time from an impractical 20+ hours on a single node down to around 2 hours. As a result of this work, we release the entire pseudonymized dataset and make the workflows and source code publicly available. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 9 pages, 5 figures, accepted to BTSD 2022 workshop (see https://sites.google.com/view/btsd2022 for more information), to be published in the proceedings of IEEE Big Data 2022

arXiv:2210.03170 [pdf, other]

doi 10.1109/PMBS56514.2022.00014

WfBench: Automated Generation of Scientific Workflow Benchmarks

Authors: Tainã Coleman, Henri Casanova, Ketan Maheshwari, Loïc Pottier, Sean R. Wilkinson, Justin Wozniak, Frédéric Suter, Mallikarjun Shankar, Rafael Ferreira da Silva

Abstract: The prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class high-performance computing (HPC) clusters. To handle the deployment, monitoring, and optimization of workflow executions, many workflow systems have been developed over the past decade. There is a need for workflow bench… ▽ More The prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class high-performance computing (HPC) clusters. To handle the deployment, monitoring, and optimization of workflow executions, many workflow systems have been developed over the past decade. There is a need for workflow benchmarks that can be used to evaluate the performance of workflow systems on current and future software stacks and hardware platforms. We present a generator of realistic workflow benchmark specifications that can be translated into benchmark code to be executed with current workflow systems. Our approach generates workflow tasks with arbitrary performance characteristics (CPU, memory, and I/O usage) and with realistic task dependency structures based on those seen in production workflows. We present experimental results that show that our approach generates benchmarks that are representative of production workflows, and conduct a case study to demonstrate the use and usefulness of our generated benchmarks to evaluate the performance of workflow systems under different configuration scenarios. △ Less

Submitted 6 October, 2022; originally announced October 2022.

arXiv:2209.09022 [pdf, ps, other]

doi 10.1109/eScience55777.2022.00090

F*** workflows: when parts of FAIR are missing

Authors: Sean R. Wilkinson, Greg Eisenhauer, Anuj J. Kapadia, Kathryn Knight, Jeremy Logan, Patrick Widener, Matthew Wolf

Abstract: The FAIR principles for scientific data (Findable, Accessible, Interoperable, Reusable) are also relevant to other digital objects such as research software and scientific workflows that operate on scientific data. The FAIR principles can be applied to the data being handled by a scientific workflow as well as the processes, software, and other infrastructure which are necessary to specify and exe… ▽ More The FAIR principles for scientific data (Findable, Accessible, Interoperable, Reusable) are also relevant to other digital objects such as research software and scientific workflows that operate on scientific data. The FAIR principles can be applied to the data being handled by a scientific workflow as well as the processes, software, and other infrastructure which are necessary to specify and execute a workflow. The FAIR principles were designed as guidelines, rather than rules, that would allow for differences in standards for different communities and for different degrees of compliance. There are many practical considerations which impact the level of FAIR-ness that can actually be achieved, including policies, traditions, and technologies. Because of these considerations, obstacles are often encountered during the workflow lifecycle that trace directly to shortcomings in the implementation of the FAIR principles. Here, we detail some cases, without naming names, in which data and workflows were Findable but otherwise lacking in areas commonly needed and expected by modern FAIR methods, tools, and users. We describe how some of these problems, all of which were overcome successfully, have motivated us to push on systems and approaches for fully FAIR workflows. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: 6 pages, 0 figures, accepted to ERROR 2022 workshop (see https://error-workshop.org/ for more information), to be published in proceedings of IEEE eScience 2022

arXiv:2204.08354 [pdf, other]

doi 10.1007/978-3-031-08751-6_37

Unveiling User Behavior on Summit Login Nodes as a User

Authors: Sean R. Wilkinson, Ketan Maheshwari, Rafael Ferreira da Silva

Abstract: We observe and analyze usage of the login nodes of the leadership class Summit supercomputer from the perspective of an ordinary user -- not a system administrator -- by periodically sampling user activities (job queues, running processes, etc.) for two full years (2020-2021). Our findings unveil key usage patterns that evidence misuse of the system, including gaming the policies, impairing I/O pe… ▽ More We observe and analyze usage of the login nodes of the leadership class Summit supercomputer from the perspective of an ordinary user -- not a system administrator -- by periodically sampling user activities (job queues, running processes, etc.) for two full years (2020-2021). Our findings unveil key usage patterns that evidence misuse of the system, including gaming the policies, impairing I/O performance, and using login nodes as a sole computing resource. Our analysis highlights observed patterns for the execution of complex computations (workflows), which are key for processing large-scale applications. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: International Conference on Computational Science (ICCS), 2022

Showing 1–5 of 5 results for author: Wilkinson, S R