-
Design and Implementation of an Analysis Pipeline for Heterogeneous Data
Authors:
Arup Kumar Sarker,
Aymen Alsaadi,
Niranda Perera,
Mills Staylor,
Gregor von Laszewski,
Matteo Turilli,
Ozgur Ozan Kilic,
Mikhail Titov,
Andre Merzky,
Shantenu Jha,
Geoffrey Fox
Abstract:
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. In…
▽ More
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
△ Less
Submitted 7 April, 2024; v1 submitted 23 March, 2024;
originally announced March 2024.
-
MLCommons Cloud Masking Benchmark with Early Stop**
Authors:
Varshitha Chennamsetti,
Gregor von Laszewski,
Ruochen Gu,
Laiba Mehnaz,
Juri Papay,
Samuel Jackson,
Jeyan Thiyagalingam,
Sergey V. Samsonau,
Geoffrey C. Fox
Abstract:
In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop.…
▽ More
In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop. We provide a description of the cloud masking benchmark, as well as a summary of our submission to MLCommons on the benchmark experiment we conducted. It includes a modification to the reference implementation of the cloud masking benchmark enabling early stop**. This benchmark is executed on the NYU HPC through a custom batch script that runs the various experiments through the batch queuing system while allowing for variation on the number of epochs trained. Our submission includes the modified code, a custom batch script to modify epochs, documentation, and the benchmark results. We report the highest accuracy (scientific metric) and the average time taken (performance metric) for training and inference that was achieved on NYU HPC Greene. We also provide a comparison of the compute capabilities between different systems by running the benchmark for one epoch. Our submission can be found in a Globus repository that is accessible to MLCommons Science Working Group.
△ Less
Submitted 30 May, 2024; v1 submitted 11 December, 2023;
originally announced January 2024.
-
An Overview of MLCommons Cloud Mask Benchmark: Related Research and Data
Authors:
Gregor von Laszewski,
Ruochen Gu
Abstract:
Cloud masking is a crucial task that is well-motivated for meteorology and its applications in environmental and atmospheric sciences. Its goal is, given satellite images, to accurately generate cloud masks that identify each pixel in image to contain either cloud or clear sky. In this paper, we summarize some of the ongoing research activities in cloud masking, with a focus on the research and be…
▽ More
Cloud masking is a crucial task that is well-motivated for meteorology and its applications in environmental and atmospheric sciences. Its goal is, given satellite images, to accurately generate cloud masks that identify each pixel in image to contain either cloud or clear sky. In this paper, we summarize some of the ongoing research activities in cloud masking, with a focus on the research and benchmark currently conducted in MLCommons Science Working Group. This overview is produced with the hope that others will have an easier time getting started and collaborate on the activities related to MLCommons Cloud Mask Benchmark.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Whitepaper on Reusable Hybrid and Multi-Cloud Analytics Service Framework
Authors:
Gregor von Laszewski,
Wo Chang,
Russell Reinsch,
Olivera Kotevska,
Ali Karimi,
Abdul Rahman Sattar,
Garry Mazzaferro,
Geoffrey C. Fox
Abstract:
Over the last several years, the computation landscape for conducting data analytics has completely changed. While in the past, a lot of the activities have been undertaken in isolation by companies, and research institutions, today's infrastructure constitutes a wealth of services offered by a variety of providers that offer opportunities for reuse, and interactions while leveraging service colla…
▽ More
Over the last several years, the computation landscape for conducting data analytics has completely changed. While in the past, a lot of the activities have been undertaken in isolation by companies, and research institutions, today's infrastructure constitutes a wealth of services offered by a variety of providers that offer opportunities for reuse, and interactions while leveraging service collaboration, and service cooperation.
This document focuses on expanding analytics services to develop a framework for reusable hybrid multi-service data analytics. It includes (a) a short technology review that explicitly targets the intersection of hybrid multi-provider analytics services, (b) a small motivation based on use cases we looked at, (c) enhancing the concepts of services to showcase how hybrid, as well as multi-provider services can be integrated and reused via the proposed framework, (d) address analytics service composition, and (e) integrate container technologies to achieve state-of-the-art analytics service deployment
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes
Authors:
Niranda Perera,
Arup Kumar Sarker,
Mills Staylor,
Gregor von Laszewski,
Kaiying Shan,
Supun Kamburugamuve,
Chathura Widanage,
Vibhatha Abeykoon,
Thejaka Amila Kanewela,
Geoffrey Fox
Abstract:
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amoun…
▽ More
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Hybrid Reusable Computational Analytics Workflow Management with Cloudmesh
Authors:
Gregor von Laszewski,
J. P. Fleischer,
Geoffrey C. Fox
Abstract:
In this paper, we summarize our effort to create and utilize a simple framework to coordinate computational analytics tasks with the help of a workflow system. Our design is based on a minimalistic approach while at the same time allowing to access computational resources offered through the owner's computer, HPC computing centers, cloud resources, and distributed systems in general. The access to…
▽ More
In this paper, we summarize our effort to create and utilize a simple framework to coordinate computational analytics tasks with the help of a workflow system. Our design is based on a minimalistic approach while at the same time allowing to access computational resources offered through the owner's computer, HPC computing centers, cloud resources, and distributed systems in general. The access to this framework includes a simple GUI for monitoring and managing the workflow, a REST service, a command line interface, as well as a Python interface. The resulting framework was developed for several examples targeting benchmarks of AI applications on hybrid compute resources and as an educational tool for teaching scientists and students sophisticated concepts to execute computations on resources ranging from a single computer to many thousands of computers as part of on-premise and cloud infrastructure. We demonstrate the usefulness of the tool on a number of examples. The code is available as an open-source project in GitHub and is based on an easy-to-enhance tool called cloudmesh.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes
Authors:
Jacques Fleischer,
Gregor von Laszewski,
Carlos Theran,
Yohn Jairo Parra Bautista
Abstract:
In this paper we apply neural networks and Artificial Intelligence (AI) to historical records of high-risk cryptocurrency coins to train a prediction model that guesses their price. This paper's code contains Jupyter notebooks, one of which outputs a timeseries graph of any cryptocurrency price once a CSV file of the historical data is inputted into the program. Another Jupyter notebook trains an…
▽ More
In this paper we apply neural networks and Artificial Intelligence (AI) to historical records of high-risk cryptocurrency coins to train a prediction model that guesses their price. This paper's code contains Jupyter notebooks, one of which outputs a timeseries graph of any cryptocurrency price once a CSV file of the historical data is inputted into the program. Another Jupyter notebook trains an LSTM, or a long short-term memory model, to predict a cryptocurrency's closing price. The LSTM is fed the close price, which is the price that the currency has at the end of the day, so it can learn from those values. The notebook creates two sets: a training set and a test set to assess the accuracy of the results.
The data is then normalized using manual min-max scaling so that the model does not experience any bias; this also enhances the performance of the model. Then, the model is trained using three layers -- an LSTM, dropout, and dense layer-minimizing the loss through 50 epochs of training; from this training, a recurrent neural network (RNN) is produced and fitted to the training set. Additionally, a graph of the loss over each epoch is produced, with the loss minimizing over time. Finally, the notebook plots a line graph of the actual currency price in red and the predicted price in blue. The process is then repeated for several more cryptocurrencies to compare prediction models. The parameters for the LSTM, such as number of epochs and batch size, are tweaked to try and minimize the root mean square error.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
HPTMT Parallel Operators for High Performance Data Science & Data Engineering
Authors:
Vibhatha Abeykoon,
Supun Kamburugamuve,
Chathura Widanage,
Niranda Perera,
Ahmet Uyar,
Thejaka Amila Kanewala,
Gregor von Laszewski,
Geoffrey Fox
Abstract:
Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field ha…
▽ More
Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks
Authors:
Supun Kamburugamuve,
Chathura Widanage,
Niranda Perera,
Vibhatha Abeykoon,
Ahmet Uyar,
Thejaka Amila Kanewala,
Gregor von Laszewski,
Geoffrey Fox
Abstract:
Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired…
▽ More
Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT.
△ Less
Submitted 29 July, 2021; v1 submitted 27 July, 2021;
originally announced July 2021.
-
AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates
Authors:
Geoffrey C. Fox,
Gregor von Laszewski,
Fugang Wang,
Saumyadipta Pyne
Abstract:
The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting wi…
▽ More
The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on LSTM and even modeling. To demonstrate our approach, we have conducted a pilot that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population's socioeconomic, health and behavioral risk factors at a local level. The compiled data are fed into AICov, and thus we obtain improved prediction by integration of the data to our model as compared to one that only uses case and death data.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.