Search | arXiv e-print repository

A Comprehensive Evaluation of Generative Models in Calorimeter Shower Simulation

Authors: Farzana Yasmin Ahmad, Vanamala Venkataswamy, Geoffrey Fox

Abstract: The pursuit of understanding fundamental particle interactions has reached unparalleled precision levels. Particle physics detectors play a crucial role in generating low-level object signatures that encode collision physics. However, simulating these particle collisions is a demanding task in terms of memory and computation which will be exasperated with larger data volumes, more complex detector… ▽ More The pursuit of understanding fundamental particle interactions has reached unparalleled precision levels. Particle physics detectors play a crucial role in generating low-level object signatures that encode collision physics. However, simulating these particle collisions is a demanding task in terms of memory and computation which will be exasperated with larger data volumes, more complex detectors, and a higher pileup environment in the High-Luminosity LHC. The introduction of "Fast Simulation" has been pivotal in overcoming computational bottlenecks. The use of deep-generative models has sparked a surge of interest in surrogate modeling for detector simulations, generating particle showers that closely resemble the observed data. Nonetheless, there is a pressing need for a comprehensive evaluation of their performance using a standardized set of metrics. In this study, we conducted a rigorous evaluation of three generative models using standard datasets and a diverse set of metrics derived from physics, computer vision, and statistics. Furthermore, we explored the impact of using full versus mixed precision modes during inference. Our evaluation revealed that the CaloDiffusion and CaloScore generative models demonstrate the most accurate simulation of particle showers, yet there remains substantial room for improvement. Our findings identified areas where the evaluated models fell short in accurately replicating Geant4 data. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2403.15721 [pdf, other]

Design and Implementation of an Analysis Pipeline for Heterogeneous Data

Authors: Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey Fox

Abstract: Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. In… ▽ More Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community. △ Less

Submitted 7 April, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

Comments: 14 pages, 16 figures, 2 tables

ACM Class: H.2.4; D.2.7; D.2.2

arXiv:2401.08636

MLCommons Cloud Masking Benchmark with Early Stop**

Authors: Varshitha Chennamsetti, Gregor von Laszewski, Ruochen Gu, Laiba Mehnaz, Juri Papay, Samuel Jackson, Jeyan Thiyagalingam, Sergey V. Samsonau, Geoffrey C. Fox

Abstract: In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop.… ▽ More In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop. We provide a description of the cloud masking benchmark, as well as a summary of our submission to MLCommons on the benchmark experiment we conducted. It includes a modification to the reference implementation of the cloud masking benchmark enabling early stop**. This benchmark is executed on the NYU HPC through a custom batch script that runs the various experiments through the batch queuing system while allowing for variation on the number of epochs trained. Our submission includes the modified code, a custom batch script to modify epochs, documentation, and the benchmark results. We report the highest accuracy (scientific metric) and the average time taken (performance metric) for training and inference that was achieved on NYU HPC Greene. We also provide a comparison of the compute capabilities between different systems by running the benchmark for one epoch. Our submission can be found in a Globus repository that is accessible to MLCommons Science Working Group. △ Less

Submitted 30 May, 2024; v1 submitted 11 December, 2023; originally announced January 2024.

Comments: NYU did not approve the publication of the paper

arXiv:2312.14199 [pdf, other]

Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

Authors: Geoffrey Fox, Mary P Thomas, Sajal Bhatia, Marisa Brazil, Nicole M Gasparini, Venkatesh Mohan Merwade, Henry J. Neeman, Jeff Carver, Henri Casanova, Vipin Chaudhary, Dirk Colbry, Lonnie Crosby, Prasun Dewan, Jessica Eisma, Nicole M Gasparini, Ahmed Irfan, Kate Kaehey, Qianqian Liu, Zhen Ni, Sushil Prasad, Apan Qasem, Erik Saule, Prabha Sundaravadivel, Karen Tomko

Abstract: This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the futur… ▽ More This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the future directions suggested by the PI community. The meeting was held simultaneously with that of the PIs of the NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program. This co-location led to two joint sessions: one with NSF speakers and the other on broader impact. Further, the joint poster and refreshment sessions benefited from the interactions between CSSI and CyberTraining PIs. △ Less

Submitted 28 December, 2023; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 38 pages, 3 main sections and 2 Appendix sections, 2 figures, 19 tables; updated version: author corrections

arXiv:2312.02368 [pdf, other]

RINAS: Training with Dataset Shuffling Can Be General and Fast

Authors: Tianle Zhong, Jiechen Zhao, Xindi Guo, Qiang Su, Geoffrey Fox

Abstract: Deep learning datasets are expanding at an unprecedented pace, creating new challenges for data processing in model training pipelines. A crucial aspect of these pipelines is dataset shuffling, which significantly improves unbiased learning and convergence accuracy by adhering to the principles of random sampling. However, loading shuffled data for large datasets incurs significant overhead in the… ▽ More Deep learning datasets are expanding at an unprecedented pace, creating new challenges for data processing in model training pipelines. A crucial aspect of these pipelines is dataset shuffling, which significantly improves unbiased learning and convergence accuracy by adhering to the principles of random sampling. However, loading shuffled data for large datasets incurs significant overhead in the deep learning pipeline and severely impacts the end-to-end training throughput. To mitigate this, current deep learning systems often resort to partial dataset shuffling, sacrificing global randomness to maintain acceptable training throughput on large datasets, still leaving global shuffling efficiency issues not fully explored. In this work, we present RINAS, a data loading framework that systematically addresses the performance bottleneck of loading global shuffled datasets. Our key contribution is to offer an intra-batch unordered data fetching approach, which unleashes unexplored parallelism of data loading. We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision. Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.01635 [pdf, other]

RTP: Rethinking Tensor Parallelism with Memory Deduplication

Authors: Cheng Luo, Tianle Zhong, Geoffrey Fox

Abstract: In the evolving landscape of neural network models, one prominent challenge stand out: the significant memory overheads associated with training expansive models. Addressing this challenge, this study delves deep into the Rotated Tensor Parallelism (RTP). RTP is an innovative approach that strategically focuses on memory deduplication in distributed training environments. It boasts of unique featu… ▽ More In the evolving landscape of neural network models, one prominent challenge stand out: the significant memory overheads associated with training expansive models. Addressing this challenge, this study delves deep into the Rotated Tensor Parallelism (RTP). RTP is an innovative approach that strategically focuses on memory deduplication in distributed training environments. It boasts of unique features like a customized communication primitive and the Flyweight Pattern initialization. Furthermore, RTP ensures a seamless overlap between partition computation and partition weight communication, optimizing the training process. Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal - distributing the memory overhead of a single machine equitably among multiple machines. The experimental results demonstrate that RTP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of memory. Code of RTP is available at https://github.com/wdlctc/rtp. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.17013 [pdf, other]

Whitepaper on Reusable Hybrid and Multi-Cloud Analytics Service Framework

Authors: Gregor von Laszewski, Wo Chang, Russell Reinsch, Olivera Kotevska, Ali Karimi, Abdul Rahman Sattar, Garry Mazzaferro, Geoffrey C. Fox

Abstract: Over the last several years, the computation landscape for conducting data analytics has completely changed. While in the past, a lot of the activities have been undertaken in isolation by companies, and research institutions, today's infrastructure constitutes a wealth of services offered by a variety of providers that offer opportunities for reuse, and interactions while leveraging service colla… ▽ More Over the last several years, the computation landscape for conducting data analytics has completely changed. While in the past, a lot of the activities have been undertaken in isolation by companies, and research institutions, today's infrastructure constitutes a wealth of services offered by a variety of providers that offer opportunities for reuse, and interactions while leveraging service collaboration, and service cooperation. This document focuses on expanding analytics services to develop a framework for reusable hybrid multi-service data analytics. It includes (a) a short technology review that explicitly targets the intersection of hybrid multi-provider analytics services, (b) a small motivation based on use cases we looked at, (c) enhancing the concepts of services to showcase how hybrid, as well as multi-provider services can be integrated and reused via the proposed framework, (d) address analytics service composition, and (e) integrate container technologies to achieve state-of-the-art analytics service deployment △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2307.01394 [pdf, ps, other]

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Authors: Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox

Abstract: The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amoun… ▽ More The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Report number: FGCS-D-23-00577R1

arXiv:2306.04025 [pdf, other]

Designing explainable artificial intelligence with active inference: A framework for transparent introspection and decision-making

Authors: Mahault Albarracin, Inês Hipólito, Safae Essafi Tremblay, Jason G. Fox, Gabriel René, Karl Friston, Maxwell J. D. Ramstead

Abstract: This paper investigates the prospect of develo** human-interpretable, explainable artificial intelligence (AI) systems based on active inference and the free energy principle. We first provide a brief overview of active inference, and in particular, of how it applies to the modeling of decision-making, introspection, as well as the generation of overt and covert actions. We then discuss how acti… ▽ More This paper investigates the prospect of develo** human-interpretable, explainable artificial intelligence (AI) systems based on active inference and the free energy principle. We first provide a brief overview of active inference, and in particular, of how it applies to the modeling of decision-making, introspection, as well as the generation of overt and covert actions. We then discuss how active inference can be leveraged to design explainable AI systems, namely, by allowing us to model core features of ``introspective'' processes and by generating useful, human-interpretable models of the processes involved in decision-making. We propose an architecture for explainable AI systems using active inference. This architecture foregrounds the role of an explicit hierarchical generative model, the operation of which enables the AI system to track and explain the factors that contribute to its own decisions, and whose structure is designed to be interpretable and auditable by human users. We outline how this architecture can integrate diverse sources of information to make informed decisions in an auditable manner, mimicking or reproducing aspects of human-like consciousness and introspection. Finally, we discuss the implications of our findings for future research in AI, and the potential ethical considerations of develo** AI systems with (the appearance of) introspective capabilities. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2302.03786 [pdf, other]

Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates for a Diffusion Equation

Authors: J. Quetzalcoatl Toledo-Marin, James A. Glazier, Geoffrey Fox

Abstract: Neural networks (NNs) have proven to be a viable alternative to traditional direct numerical algorithms, with the potential to accelerate computational time by several orders of magnitude. In the present paper we study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers. The construction of such surrogates requires the selection of an appr… ▽ More Neural networks (NNs) have proven to be a viable alternative to traditional direct numerical algorithms, with the potential to accelerate computational time by several orders of magnitude. In the present paper we study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers. The construction of such surrogates requires the selection of an appropriate task, network architecture, training set structure and size, loss function, and training algorithm hyperparameters. It is well known that each of these factors can have a significant impact on the performance of the resultant model. Our approach employs an encoder-decoder CNN architecture, which we posit is particularly well-suited for this task due to its ability to effectively transform data, as opposed to merely compressing it. We systematically evaluate a range of loss functions, hyperparameters, and training set sizes. Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error. Additionally, we observe that the performance of the model exhibits a logarithmic dependence on the training set size. Furthermore, we investigate the effect on model performance by using different subsets of data with varying features. Our results highlight the importance of sampling the configurational space in an optimal manner, as this can have a significant impact on the performance of the model and the required training time. In conclusion, our results suggest that training a model with a pre-determined error performance bound is not a viable approach, as it does not guarantee that edge cases with errors larger than the bound do not exist. Furthermore, as most surrogate tasks involve a high dimensional landscape, an ever increasing training set size is, in principle, needed, however it is not a practical solution. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 21 ps, 17 figs, 8 ts

arXiv:2301.07896 [pdf, other]

Supercharging Distributed Computing Environments For High Performance Data Engineering

Authors: Niranda Perera, Kaiying Shan, Supun Kamburugamuwe, Thejaka Amila Kanewela, Chathura Widanage, Arup Sarker, Mills Staylor, Tianle Zhong, Vibhatha Abeykoon, Geoffrey Fox

Abstract: The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is esse… ▽ More The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon & CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems. △ Less

Submitted 19 January, 2023; originally announced January 2023.

arXiv:2212.13732 [pdf, ps, other]

Hybrid Cloud and HPC Approach to High-Performance Dataframes

Authors: Kaiying Shan, Niranda Perera, Damitha Lenadora, Tianle Zhong, Arup Sarker, Supun Kamburugamuve, Thejaka Amila Kanewela, Chathura Widanage, Geoffrey Fox

Abstract: Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data processing both as a standalone application and as a library, especially for Python applications. While Cylon shows promising performance results, we experienced difficu… ▽ More Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data processing both as a standalone application and as a library, especially for Python applications. While Cylon shows promising performance results, we experienced difficulties trying to integrate with frameworks incompatible with the traditional Message Passing Interface (MPI). While MPI implementations encompass scalable and efficient communication routines, their process launching mechanisms work well with mainstream HPC systems but are incompatible with some environments that adopt their own resource management systems. In this work, we alleviated this issue by directly integrating the Unified Communication X (UCX) framework, which supports a variety of classic HPC and non-HPC process-bootstrap** mechanisms as our communication framework. While we experimented with our methodology on Cylon, the same technique can be used to bring MPI communication to other applications that do not employ MPI's built-in process management approach. △ Less

Submitted 29 December, 2022; v1 submitted 28 December, 2022; originally announced December 2022.

arXiv:2212.01354 [pdf, other]

doi 10.1177/26339137231222481

Designing Ecosystems of Intelligence from First Principles

Authors: Karl J Friston, Maxwell J D Ramstead, Alex B Kiefer, Alexander Tschantz, Christopher L Buckley, Mahault Albarracin, Riddhi J Pitliya, Conor Heins, Brennan Klein, Beren Millidge, Dalton A R Sakthivadivel, Toby St Clere Smithe, Magnus Koudahl, Safae Essafi Tremblay, Capm Petersen, Kaiser Fung, Jason G Fox, Steven Swanson, Dan Mapes, Gabriel René

Abstract: This white paper lays out a vision of research and development in the field of artificial intelligence for the next decade (and beyond). Its denouement is a cyber-physical ecosystem of natural and synthetic sense-making, in which humans are integral participants -- what we call ''shared intelligence''. This vision is premised on active inference, a formulation of adaptive behavior that can be read… ▽ More This white paper lays out a vision of research and development in the field of artificial intelligence for the next decade (and beyond). Its denouement is a cyber-physical ecosystem of natural and synthetic sense-making, in which humans are integral participants -- what we call ''shared intelligence''. This vision is premised on active inference, a formulation of adaptive behavior that can be read as a physics of intelligence, and which inherits from the physics of self-organization. In this context, we understand intelligence as the capacity to accumulate evidence for a generative model of one's sensed world -- also known as self-evidencing. Formally, this corresponds to maximizing (Bayesian) model evidence, via belief updating over several scales: i.e., inference, learning, and model selection. Operationally, this self-evidencing can be realized via (variational) message passing or belief propagation on a factor graph. Crucially, active inference foregrounds an existential imperative of intelligent systems; namely, curiosity or the resolution of uncertainty. This same imperative underwrites belief sharing in ensembles of agents, in which certain aspects (i.e., factors) of each agent's generative world model provide a common ground or frame of reference. Active inference plays a foundational role in this ecology of belief sharing -- leading to a formal account of collective intelligence that rests on shared narratives and goals. We also consider the kinds of communication protocols that must be developed to enable such an ecosystem of intelligences and motivate the development of a shared hyper-spatial modeling language and transaction protocol, as a first -- and key -- step towards such an ecology. △ Less

Submitted 11 January, 2024; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: 23+18 pages, one figure, one six page appendix

Journal ref: Collective Intelligence, 3(1), 2024

arXiv:2211.11402 [pdf, other]

doi 10.1145/3550454.3555469

An Implicit Parametric Morphable Dental Model

Authors: Congyi Zhang, Mohamed Elgharib, Gereon Fox, Min Gu, Christian Theobalt, Wen** Wang

Abstract: 3D Morphable models of the human body capture variations among subjects and are useful in reconstruction and editing applications. Current dental models use an explicit mesh scene representation and model only the teeth, ignoring the gum. In this work, we present the first parametric 3D morphable dental model for both teeth and gum. Our model uses an implicit scene representation and is learned fr… ▽ More 3D Morphable models of the human body capture variations among subjects and are useful in reconstruction and editing applications. Current dental models use an explicit mesh scene representation and model only the teeth, ignoring the gum. In this work, we present the first parametric 3D morphable dental model for both teeth and gum. Our model uses an implicit scene representation and is learned from rigidly aligned scans. It is based on a component-wise representation for each tooth and the gum, together with a learnable latent code for each of such components. It also learns a template shape thus enabling several applications such as segmentation, interpolation, and tooth replacement. Our reconstruction quality is on par with the most advanced global implicit representations while enabling novel applications. Project page: https://vcai.mpi-inf.mpg.de/projects/DMM/ △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2210.16941 [pdf, other]

Hybrid Reusable Computational Analytics Workflow Management with Cloudmesh

Authors: Gregor von Laszewski, J. P. Fleischer, Geoffrey C. Fox

Abstract: In this paper, we summarize our effort to create and utilize a simple framework to coordinate computational analytics tasks with the help of a workflow system. Our design is based on a minimalistic approach while at the same time allowing to access computational resources offered through the owner's computer, HPC computing centers, cloud resources, and distributed systems in general. The access to… ▽ More In this paper, we summarize our effort to create and utilize a simple framework to coordinate computational analytics tasks with the help of a workflow system. Our design is based on a minimalistic approach while at the same time allowing to access computational resources offered through the owner's computer, HPC computing centers, cloud resources, and distributed systems in general. The access to this framework includes a simple GUI for monitoring and managing the workflow, a REST service, a command line interface, as well as a Python interface. The resulting framework was developed for several examples targeting benchmarks of AI applications on hybrid compute resources and as an educational tool for teaching scientists and students sophisticated concepts to execute computations on resources ranging from a single computer to many thousands of computers as part of on-premise and cloud infrastructure. We demonstrate the usefulness of the tool on a number of examples. The code is available as an open-source project in GitHub and is based on an easy-to-enhance tool called cloudmesh. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: 12 pages, 3 apendies, 23 Figures, 4 Tables

arXiv:2210.08973 [pdf, ps, other]

doi 10.1038/s41597-023-02298-6

FAIR for AI: An interdisciplinary and international community building perspective

Authors: E. A. Huerta, Ben Blaiszik, L. Catherine Brinson, Kristofer E. Bouchard, Daniel Diaz, Caterina Doglioni, Javier M. Duarte, Murali Emani, Ian Foster, Geoffrey Fox, Philip Harris, Lukas Heinrich, Shantenu Jha, Daniel S. Katz, Volodymyr Kindratenko, Christine R. Kirkpatrick, Kati Lassila-Perini, Ravi K. Madduri, Mark S. Neubauer, Fotis E. Psomopoulos, Avik Roy, Oliver Rübel, Zhizhen Zhao, Ruike Zhu

Abstract: A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to i… ▽ More A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022. △ Less

Submitted 1 August, 2023; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: 10 pages, comments welcome!; v2: 12 pages, accepted to Scientific Data

ACM Class: I.2.0; E.0

Journal ref: Scientific Data 10, 487 (2023)

arXiv:2209.06146 [pdf, other]

doi 10.1007/978-3-031-30442-2_22

High Performance Dataframes from Parallel Processing Patterns

Authors: Niranda Perera, Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Ahmet Uyar, Kaiying Shan, Hasara Maithree, Damitha Lenadora, Thejaka Amila Kanewala, Geoffrey Fox

Abstract: The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderat… ▽ More The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators. In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also underline the flexibility of the proposed API and the extensibility of the framework on different hardware. To the best of our knowledge, Cylon is the first and only distributed-memory parallel dataframe system available today. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: Will be presented in PPAM 2022

arXiv:2201.06717 [pdf, other]

GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Authors: Bo Feng, Geoffrey Fox

Abstract: Spatiotemporal time series nowcasting should preserve temporal and spatial dynamics in the sense that generated new sequences from models respect the covariance relationship from history. Conventional feature extractors are built with deep convolutional neural networks (CNN). However, CNN models have limits to image-like applications where data can be formed with high-dimensional arrays. In contra… ▽ More Spatiotemporal time series nowcasting should preserve temporal and spatial dynamics in the sense that generated new sequences from models respect the covariance relationship from history. Conventional feature extractors are built with deep convolutional neural networks (CNN). However, CNN models have limits to image-like applications where data can be formed with high-dimensional arrays. In contrast, applications in social networks, road traffic, physics, and chemical property prediction where data features can be organized with nodes and edges of graphs. Transformer architecture is an emerging method for predictive models, bringing high accuracy and efficiency due to attention mechanism design. This paper proposes a spatiotemporal model, namely GTrans, that transforms data features into graph embeddings and predicts temporal dynamics with a transformer model. According to our experiments, we demonstrate that GTrans can model spatial and temporal dynamics and nowcasts extreme events for datasets. Furthermore, in all the experiments, GTrans can achieve the highest F1 and F2 scores in binary-class prediction tests than the baseline models. △ Less

Submitted 17 January, 2022; originally announced January 2022.

arXiv:2201.01869 [pdf]

Earthquake Nowcasting with Deep Learning

Authors: Geoffrey Fox, John Rundle, Andrea Donnellan, Bo Feng

Abstract: We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spa… ▽ More We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS. △ Less

Submitted 18 December, 2021; originally announced January 2022.

arXiv:2110.12773 [pdf]

Scientific Machine Learning Benchmarks

Authors: Jeyan Thiyagalingam, Mallikarjun Shankar, Geoffrey Fox, Tony Hey

Abstract: The breakthrough in Deep Learning neural networks has transformed the use of AI and machine learning technologies for the analysis of very large experimental datasets. These datasets are typically generated by large-scale experimental facilities at national laboratories. In the context of science, scientific machine learning focuses on training machines to identify patterns, trends, and anomalies… ▽ More The breakthrough in Deep Learning neural networks has transformed the use of AI and machine learning technologies for the analysis of very large experimental datasets. These datasets are typically generated by large-scale experimental facilities at national laboratories. In the context of science, scientific machine learning focuses on training machines to identify patterns, trends, and anomalies to extract meaningful scientific insights from such datasets. With a new generation of experimental facilities, the rate of data generation and the scale of data volumes will increasingly require the use of more automated data analysis. At present, identifying the most appropriate machine learning algorithm for the analysis of any given scientific dataset is still a challenge for scientists. This is due to many different machine learning frameworks, computer architectures, and machine learning models. Historically, for modelling and simulation on HPC systems such problems have been addressed through benchmarking computer applications, algorithms, and architectures. Extending such a benchmarking approach and identifying metrics for the application of machine learning methods to scientific datasets is a new challenge for both scientists and computer scientists. In this paper, we describe our approach to the development of scientific machine learning benchmarks and review other approaches to benchmarking scientific machine learning. △ Less

Submitted 25 October, 2021; originally announced October 2021.

ACM Class: I.2

arXiv:2110.11466 [pdf, other]

MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Authors: Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda , et al. (18 additional authors not shown)

Abstract: Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning appli… ▽ More Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds. △ Less

Submitted 26 October, 2021; v1 submitted 21 October, 2021; originally announced October 2021.

arXiv:2108.06001 [pdf, other]

HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Authors: Vibhatha Abeykoon, Supun Kamburugamuve, Chathura Widanage, Niranda Perera, Ahmet Uyar, Thejaka Amila Kanewala, Gregor von Laszewski, Geoffrey Fox

Abstract: Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field ha… ▽ More Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. △ Less

Submitted 12 August, 2021; originally announced August 2021.

arXiv:2107.12807 [pdf, other]

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Authors: Supun Kamburugamuve, Chathura Widanage, Niranda Perera, Vibhatha Abeykoon, Ahmet Uyar, Thejaka Amila Kanewala, Gregor von Laszewski, Geoffrey Fox

Abstract: Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired… ▽ More Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT. △ Less

Submitted 29 July, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

arXiv:2107.07224 [pdf, other]

StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN

Authors: Gereon Fox, Ayush Tewari, Mohamed Elgharib, Christian Theobalt

Abstract: Generative adversarial models (GANs) continue to produce advances in terms of the visual quality of still images, as well as the learning of temporal correlations. However, few works manage to combine these two interesting capabilities for the synthesis of video content: Most methods require an extensive training dataset to learn temporal correlations, while being rather limited in the resolution… ▽ More Generative adversarial models (GANs) continue to produce advances in terms of the visual quality of still images, as well as the learning of temporal correlations. However, few works manage to combine these two interesting capabilities for the synthesis of video content: Most methods require an extensive training dataset to learn temporal correlations, while being rather limited in the resolution and visual quality of their output. We present a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating videos. Our formulation separates the spatial domain, in which individual frames are synthesized, from the temporal domain, in which motion is generated. For the spatial domain we use a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. The expressive power of this model allows us to embed our training videos in the StyleGAN latent space. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes. The advantageous properties of the StyleGAN space simplify the discovery of temporal correlations. We demonstrate that it suffices to train our temporal architecture on only 10 minutes of footage of 1 subject for about 6 hours. After training, our model can not only generate new portrait videos for the training subject, but also for any random subject which can be embedded in the StyleGAN space. △ Less

Submitted 30 November, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: Final draft

arXiv:2104.09014 [pdf]

Multidimensional Scaling for Gene Sequence Data with Autoencoders

Authors: Pulasthi Wickramasinghe, Geoffrey Fox

Abstract: Multidimensional scaling of gene sequence data has long played a vital role in analysing gene sequence data to identify clusters and patterns. However the computation complexities and memory requirements of state-of-the-art dimensional scaling algorithms make it infeasible to scale to large datasets. In this paper we present an autoencoder-based dimensional reduction model which can easily scale t… ▽ More Multidimensional scaling of gene sequence data has long played a vital role in analysing gene sequence data to identify clusters and patterns. However the computation complexities and memory requirements of state-of-the-art dimensional scaling algorithms make it infeasible to scale to large datasets. In this paper we present an autoencoder-based dimensional reduction model which can easily scale to datasets containing millions of gene sequences, while attaining results comparable to state-of-the-art MDS algorithms with minimal resource requirements. The model also supports out-of-sample data points with a 99.5%+ accuracy based on our experiments. The proposed model is evaluated against DAMDS with a real world fungi gene sequence dataset. The presented results showcase the effectiveness of the autoencoder-based dimension reduction model and its advantages. △ Less

Submitted 18 April, 2021; originally announced April 2021.

arXiv:2102.05527 [pdf, other]

Deep learning approaches to surrogates for solving the diffusion equation for mechanistic real-world simulations

Authors: J. Quetzalcóatl Toledo-Marín, Geoffrey Fox, James P. Sluka, James A. Glazier

Abstract: In many mechanistic medical, biological, physical and engineered spatiotemporal dynamic models the numerical solution of partial differential equations (PDEs) can make simulations impractically slow. Biological models require the simultaneous calculation of the spatial variation of concentration of dozens of diffusing chemical species. Machine learning surrogates, neural networks trained to provid… ▽ More In many mechanistic medical, biological, physical and engineered spatiotemporal dynamic models the numerical solution of partial differential equations (PDEs) can make simulations impractically slow. Biological models require the simultaneous calculation of the spatial variation of concentration of dozens of diffusing chemical species. Machine learning surrogates, neural networks trained to provide approximate solutions to such complicated numerical problems, can often provide speed-ups of several orders of magnitude compared to direct calculation. PDE surrogates enable use of larger models than are possible with direct calculation and can make including such simulations in real-time or near-real time workflows practical. Creating a surrogate requires running the direct calculation tens of thousands of times to generate training data and then training the neural network, both of which are computationally expensive. We use a Convolutional Neural Network to approximate the stationary solution to the diffusion equation in the case of two equal-diameter, circular, constant-value sources located at random positions in a two-dimensional square domain with absorbing boundary conditions. To improve convergence during training, we apply a training approach that uses roll-back to reject stochastic changes to the network that increase the loss function. The trained neural network approximation is about 1e3 times faster than the direct calculation for individual replicas. Because different applications will have different criteria for acceptable approximation accuracy, we discuss a variety of loss functions and accuracy estimators that can help select the best network for a particular application. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Comments: 17 pp, 2 tables, 11 figs, 1sm, 12sm-figs, code available at GitHub

arXiv:2012.14336 [pdf, other]

Spatiotemporal Pattern Mining for Nowcasting Extreme Earthquakes in Southern California

Authors: Bo Feng, Geoffrey C. Fox

Abstract: Geoscience and seismology have utilized the most advanced technologies and equipment to monitor seismic events globally from the past few decades. With the enormous amount of data, modern GPU-powered deep learning presents a promising approach to analyze data and discover patterns. In recent years, there are plenty of successful deep learning models for picking seismic waves. However, forecasting… ▽ More Geoscience and seismology have utilized the most advanced technologies and equipment to monitor seismic events globally from the past few decades. With the enormous amount of data, modern GPU-powered deep learning presents a promising approach to analyze data and discover patterns. In recent years, there are plenty of successful deep learning models for picking seismic waves. However, forecasting extreme earthquakes, which can cause disasters, is still an underdeveloped topic in history. Relevant research in spatiotemporal dynamics mining and forecasting has revealed some successful predictions, a crucial topic in many scientific research fields. Most studies of them have many successful applications of using deep neural networks. In Geology and Earth science studies, earthquake prediction is one of the world's most challenging problems, about which cutting-edge deep learning technologies may help discover some valuable patterns. In this project, we propose a deep learning modeling approach, namely \tseqpre, to mine spatiotemporal patterns from data to nowcast extreme earthquakes by discovering visual dynamics in regional coarse-grained spatial grids over time. In this modeling approach, we use synthetic deep learning neural networks with domain knowledge in geoscience and seismology to exploit earthquake patterns for prediction using convolutional long short-term memory neural networks. Our experiments show a strong correlation between location prediction and magnitude prediction for earthquakes in Southern California. Ablation studies and visualization validate the effectiveness of the proposed modeling method. △ Less

Submitted 11 September, 2021; v1 submitted 20 December, 2020; originally announced December 2020.

Journal ref: IEEE eScience 2021

arXiv:2010.14596 [pdf, other]

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

Authors: Niranda Perera, Vibhatha Abeykoon, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila Kanewala, Pulasthi Wickramasinghe, Ahmet Uyar, Hasara Maithree, Damitha Lenadora, Geoffrey Fox

Abstract: In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new applications in both enterprise and research communities. Aggregations (also termed reduce in functional programming) are an integral functionality in these appl… ▽ More In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new applications in both enterprise and research communities. Aggregations (also termed reduce in functional programming) are an integral functionality in these applications. They are traditionally aimed at generating meaningful information on large data-sets, and today, they are being used for engineering more effective features for complex AI models. Aggregations are usually carried out on top of data abstractions such as tables/ arrays and are combined with other operations such as grou** of values. There are frameworks that excel in the said domains individually. But, we believe that there is an essential requirement for a data analytics tool that can universally integrate with existing frameworks, and thereby increase the productivity and efficiency of the entire data analytics pipeline. Cylon endeavors to fulfill this void. In this paper, we present Cylon's fast and scalable aggregation operations implemented on top of a distributed in-memory table structure that universally integrates with existing frameworks. △ Less

Submitted 14 December, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

arXiv:2010.11796 [pdf, other]

CryptoGRU: Low Latency Privacy-Preserving Text Analysis With GRU

Authors: Bo Feng, Qian Lou, Lei Jiang, Geoffrey C. Fox

Abstract: Billions of text analysis requests containing private emails, personal text messages, and sensitive online reviews, are processed by recurrent neural networks (RNNs) deployed on public clouds every day. Although prior secure networks combine homomorphic encryption (HE) and garbled circuit (GC) to preserve users' privacy, naively adopting the HE and GC hybrid technique to implement RNNs suffers fro… ▽ More Billions of text analysis requests containing private emails, personal text messages, and sensitive online reviews, are processed by recurrent neural networks (RNNs) deployed on public clouds every day. Although prior secure networks combine homomorphic encryption (HE) and garbled circuit (GC) to preserve users' privacy, naively adopting the HE and GC hybrid technique to implement RNNs suffers from long inference latency due to slow activation functions. In this paper, we present a HE and GC hybrid gated recurrent unit (GRU) network, CryptoGRU, for low-latency secure inferences. CryptoGRU replaces computationally expensive GC-based $tanh$ with fast GC-based $ReLU$, and then quantizes $sigmoid$ and $ReLU$ with a smaller bit length to accelerate activations in a GRU. We evaluate CryptoGRU with multiple GRU models trained on 4 public datasets. Experimental results show CryptoGRU achieves top-notch accuracy and improves the secure inference latency by up to $138\times$ over one of state-of-the-art secure networks on the Penn Treebank dataset. △ Less

Submitted 9 September, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

Journal ref: The 2021 Conference on Empirical Methods in Natural Language Processing

arXiv:2010.06312 [pdf, other]

Data Engineering for HPC with Python

Authors: Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, Pulasthi Wickramasinghe, Ahmet Uyar, Geoffrey Fox

Abstract: Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine l… ▽ More Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: 9 pages, 11 images, Accepted in 9th Workshop on Python for High-Performance and Scientific Computing (In conjunction with Supercomputing 20)

arXiv:2010.03757 [pdf, other]

AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Authors: Geoffrey C. Fox, Gregor von Laszewski, Fugang Wang, Saumyadipta Pyne

Abstract: The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting wi… ▽ More The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on LSTM and even modeling. To demonstrate our approach, we have conducted a pilot that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population's socioeconomic, health and behavioral risk factors at a local level. The compiled data are fed into AICov, and thus we obtain improved prediction by integration of the data to our model as compared to one that only uses case and death data. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: 25 pages, 4 tabkes, 19 figures

arXiv:2010.03712 [pdf, other]

Deep Tiered Image Segmentation For Detecting Internal Ice Layers in Radar Imagery

Authors: Yuchen Wang, Mingze Xu, John Paden, Lora Koenig, Geoffrey Fox, David Crandall

Abstract: Understanding the structure of Earth's polar ice sheets is important for modeling how global warming will impact polar ice and, in turn, the Earth's climate. Ground-penetrating radar is able to collect observations of the internal structure of snow and ice, but the process of manually labeling these observations is slow and laborious. Recent work has developed automatic techniques for finding the… ▽ More Understanding the structure of Earth's polar ice sheets is important for modeling how global warming will impact polar ice and, in turn, the Earth's climate. Ground-penetrating radar is able to collect observations of the internal structure of snow and ice, but the process of manually labeling these observations is slow and laborious. Recent work has developed automatic techniques for finding the boundaries between the ice and the bedrock, but finding internal layers - the subtle boundaries that indicate where one year's ice accumulation ended and the next began - is much more challenging because the number of layers varies and the boundaries often merge and split. In this paper, we propose a novel deep neural network for solving a general class of tiered segmentation problems. We then apply it to detecting internal layers in polar ice, evaluating on a large-scale dataset of polar ice radar data with human-labeled annotations as ground truth. △ Less

Submitted 6 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: ICME version

arXiv:2007.09589 [pdf, other]

High Performance Data Engineering Everywhere

Authors: Chathura Widanage, Niranda Perera, Vibhatha Abeykoon, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, Pulasthi Wickramasinghe, Ahmet Uyar, Gurhan Gunduz, Geoffrey Fox

Abstract: The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and po… ▽ More The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks. △ Less

Submitted 19 July, 2020; originally announced July 2020.

arXiv:2005.10360 [pdf, other]

VideoForensicsHQ: Detecting High-quality Manipulated Face Videos

Authors: Gereon Fox, Wentao Liu, Hyeongwoo Kim, Hans-Peter Seidel, Mohamed Elgharib, Christian Theobalt

Abstract: There are concerns that new approaches to the synthesis of high quality face videos may be misused to manipulate videos with malicious intent. The research community therefore developed methods for the detection of modified footage and assembled benchmark datasets for this task. In this paper, we examine how the performance of forgery detectors depends on the presence of artefacts that the human e… ▽ More There are concerns that new approaches to the synthesis of high quality face videos may be misused to manipulate videos with malicious intent. The research community therefore developed methods for the detection of modified footage and assembled benchmark datasets for this task. In this paper, we examine how the performance of forgery detectors depends on the presence of artefacts that the human eye can see. We introduce a new benchmark dataset for face video forgery detection, of unprecedented quality. It allows us to demonstrate that existing detection techniques have difficulties detecting fakes that reliably fool the human eye. We thus introduce a new family of detectors that examine combinations of spatial and temporal features and outperform existing approaches both in terms of detection accuracy and generalization. △ Less

Submitted 2 June, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: ICME 2021 camera-ready

arXiv:2004.06493 [pdf, other]

Solving Newton's Equations of Motion with Large Timesteps using Recurrent Neural Networks based Operators

Authors: JCS Kadupitiya, Geoffrey C. Fox, Vikram Jadhao

Abstract: Classical molecular dynamics simulations are based on solving Newton's equations of motion. Using a small timestep, numerical integrators such as Verlet generate trajectories of particles as solutions to Newton's equations. We introduce operators derived using recurrent neural networks that accurately solve Newton's equations utilizing sequences of past trajectory data, and produce energy-conservi… ▽ More Classical molecular dynamics simulations are based on solving Newton's equations of motion. Using a small timestep, numerical integrators such as Verlet generate trajectories of particles as solutions to Newton's equations. We introduce operators derived using recurrent neural networks that accurately solve Newton's equations utilizing sequences of past trajectory data, and produce energy-conserving dynamics of particles using timesteps up to 4000 times larger compared to the Verlet timestep. We demonstrate significant speedup in many example problems including 3D systems of up to 16 particles. △ Less

Submitted 13 December, 2021; v1 submitted 12 April, 2020; originally announced April 2020.

Comments: 15 pages, 12 figures; updated content

arXiv:1911.07101 [pdf, other]

Glyph: Fast and Accurately Training Deep Neural Networks on Encrypted Data

Authors: Qian Lou, Bo Feng, Geoffrey C. Fox, Lei Jiang

Abstract: Big data is one of the cornerstones to enabling and training deep neural networks (DNNs). Because of the lack of expertise, to gain benefits from their data, average users have to rely on and upload their private data to big data companies they may not trust. Due to the compliance, legal, or privacy constraints, most users are willing to contribute only their encrypted data, and lack interests or… ▽ More Big data is one of the cornerstones to enabling and training deep neural networks (DNNs). Because of the lack of expertise, to gain benefits from their data, average users have to rely on and upload their private data to big data companies they may not trust. Due to the compliance, legal, or privacy constraints, most users are willing to contribute only their encrypted data, and lack interests or resources to join the training of DNNs in cloud. To train a DNN on encrypted data in a completely non-interactive way, a recent work proposes a fully homomorphic encryption (FHE)-based technique implementing all activations in the neural network by \textit{Brakerski-Gentry-Vaikuntanathan (BGV)}-based lookup tables. However, such inefficient lookup-table-based activations significantly prolong the training latency of privacy-preserving DNNs. In this paper, we propose, Glyph, a FHE-based scheme to fast and accurately train DNNs on encrypted data by switching between TFHE (Fast Fully Homomorphic Encryption over the Torus) and BGV cryptosystems. Glyph uses logic-operation-friendly TFHE to implement nonlinear activations, while adopts vectorial-arithmetic-friendly BGV to perform multiply-accumulation (MAC) operations. Glyph further applies transfer learning on the training of DNNs to improve the test accuracy and reduce the number of MAC operations between ciphertext and ciphertext in convolutional layers. Our experimental results show Glyph obtains the state-of-the-art test accuracy, but reduces the training latency by $99\%$ over the prior FHE-based technique on various encrypted datasets. △ Less

Submitted 21 October, 2020; v1 submitted 16 November, 2019; originally announced November 2019.

Comments: 10 pages, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

arXiv:1911.05878 [pdf, other]

Scientific Image Restoration Anywhere

Authors: Vibhatha Abeykoon, Zhengchun Liu, Rajkumar Kettimuthu, Geoffrey Fox, Ian Foster

Abstract: The use of deep learning models within scientific experimental facilities frequently requires low-latency inference, so that, for example, quality control operations can be performed while data are being collected. Edge computing devices can be useful in this context, as their low cost and compact form factor permit them to be co-located with the experimental apparatus. Can such devices, with thei… ▽ More The use of deep learning models within scientific experimental facilities frequently requires low-latency inference, so that, for example, quality control operations can be performed while data are being collected. Edge computing devices can be useful in this context, as their low cost and compact form factor permit them to be co-located with the experimental apparatus. Can such devices, with their limited resources, can perform neural network feed-forward computations efficiently and effectively? We explore this question by evaluating the performance and accuracy of a scientific image restoration model, for which both model input and output are images, on edge computing devices. Specifically, we evaluate deployments of TomoGAN, an image-denoising model based on generative adversarial networks developed for low-dose x-ray imaging, on the Google Edge TPU and NVIDIA Jetson. We adapt TomoGAN for edge execution, evaluate model inference performance, and propose methods to address the accuracy drop caused by model quantization. We show that these edge computing devices can deliver accuracy comparable to that of a full-fledged CPU or GPU model, at speeds that are more than adequate for use in the intended deployments, denoising a 1024 x 1024 image in less than a second. Our experiments also show that the Edge TPU models can provide 3x faster inference response than a CPU-based model and 1.5x faster than an edge GPU-based model. This combination of high speed and low cost permits image restoration anywhere. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: 6 pages, 8 figures, 1 table

arXiv:1909.13340 [pdf]

Learning Everywhere: A Taxonomy for the Integration of Machine Learning and Simulations

Authors: Geoffrey Fox, Shantenu Jha

Abstract: We present a taxonomy of research on Machine Learning (ML) applied to enhance simulations together with a catalog of some activities. We cover eight patterns for the link of ML to the simulations or systems plus three algorithmic areas: particle dynamics, agent-based models and partial differential equations. The patterns are further divided into three action areas: Improving simulation with Confi… ▽ More We present a taxonomy of research on Machine Learning (ML) applied to enhance simulations together with a catalog of some activities. We cover eight patterns for the link of ML to the simulations or systems plus three algorithmic areas: particle dynamics, agent-based models and partial differential equations. The patterns are further divided into three action areas: Improving simulation with Configurations and Integration of Data, Learn Structure, Theory and Model for Simulation, and Learn to make Surrogates. △ Less

Submitted 13 October, 2019; v1 submitted 29 September, 2019; originally announced September 2019.

Comments: 15th International Conference eScience 2019, September 24-27, 2019, San Diego, California,

arXiv:1909.02363 [pdf, ps, other]

Understanding ML driven HPC: Applications and Infrastructure

Authors: Geoffrey Fox, Shantenu Jha

Abstract: We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of… ▽ More We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss "how" learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks. △ Less

Submitted 5 September, 2019; originally announced September 2019.

Comments: Invited talk to "Visionary Track" at IEEE eScience 2019. arXiv admin note: text overlap with arXiv:1806.04731 by other authors

arXiv:1907.00097 [pdf, other]

doi 10.1002/CPE.5789

Parallel Performance of Molecular Dynamics Trajectory Analysis

Authors: Mahzad Khoshlessan, Ioannis Paraskevakos, Geoffrey C. Fox, Shantenu Jha, Oliver Beckstein

Abstract: The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three dif… ▽ More The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three different XSEDE supercomputers where trajectories were read from a Lustre parallel file system. Strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process. Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large MPI communication costs or long times to open the single shared trajectory file. We improved overall strong scaling performance by either subfiling (splitting the trajectory into separate files) or MPI-IO with Parallel HDF5 trajectory files. The parallel HDF5 approach resulted in near ideal strong scaling on up to 384 cores (16 nodes), thus reducing trajectory analysis times by two orders of magnitude compared to the serial approach. △ Less

Submitted 27 March, 2020; v1 submitted 28 June, 2019; originally announced July 2019.

Comments: accepted manuscript, to appear in 'Concurrency and Computation: Practice and Experience'

ACM Class: D.1.3; J.2

arXiv:1905.01219 [pdf, other]

Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

Authors: Vibhatha Abeykoon, Geoffrey Fox, Minje Kim

Abstract: Understanding the bottlenecks in implementing stochastic gradient descent (SGD)-based distributed support vector machines (SVM) algorithm is important in training larger data sets. The communication time to do the model synchronization across the parallel processes is the main bottleneck that causes inefficiency in the training process. The model synchronization is directly affected by the mini-ba… ▽ More Understanding the bottlenecks in implementing stochastic gradient descent (SGD)-based distributed support vector machines (SVM) algorithm is important in training larger data sets. The communication time to do the model synchronization across the parallel processes is the main bottleneck that causes inefficiency in the training process. The model synchronization is directly affected by the mini-batch size of data processed before the global synchronization. In producing an efficient distributed model, the communication time in training model synchronization has to be as minimum as possible while retaining a high testing accuracy. The effect from model synchronization frequency over the convergence of the algorithm and accuracy of the generated model must be well understood to design an efficient distributed model. In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction of 98\% in communication time can be gained (16x - 24x speed up) with respect to high-frequency model synchronization. The training model optimization discussed in this paper guarantees a higher accuracy than the sequential algorithm along with faster convergence. △ Less

Submitted 3 May, 2019; originally announced May 2019.

Comments: Paper Accepted in HPML 2019 Held in conjunction with IEEE/ACM CCGRID 2019

arXiv:1902.10810 [pdf, ps, other]

Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation

Authors: Geoffrey Fox, James A. Glazier, JCS Kadupitiya, Vikram Jadhao, Minje Kim, Judy Qiu, James P. Sluka, Endre Somogyi, Madhav Marathe, Abhi** Adiga, Jiangzhuo Chen, Oliver Beckstein, Shantenu Jha

Abstract: The convergence of HPC and data-intensive methodologies provide a promising approach to major performance improvements. This paper provides a general description of the interaction between traditional HPC and ML approaches and motivates the Learning Everywhere paradigm for HPC. We introduce the concept of effective performance that one can achieve by combining learning methodologies with simulatio… ▽ More The convergence of HPC and data-intensive methodologies provide a promising approach to major performance improvements. This paper provides a general description of the interaction between traditional HPC and ML approaches and motivates the Learning Everywhere paradigm for HPC. We introduce the concept of effective performance that one can achieve by combining learning methodologies with simulation-based approaches, and distinguish between traditional performance as measured by benchmark scores. To support the promise of integrating HPC and learning methods, this paper examines specific examples and opportunities across a series of domains. It concludes with a series of open computer science and cyberinfrastructure questions and challenges that the Learning Everywhere paradigm presents. △ Less

Submitted 27 February, 2019; originally announced February 2019.

arXiv:1801.07630 [pdf, other]

Task-parallel Analysis of Molecular Dynamics Trajectories

Authors: Ioannis Paraskevakos, Andre Luckow, Mahzad Khoshlessan, George Chantzialexiou, Thomas E. Cheatham, Oliver Beckstein, Geoffrey C. Fox, Shantenu Jha

Abstract: Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them with MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simul… ▽ More Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them with MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over μsecs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect to their abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding different frameworks that enable users to select the optimal system for each application. We also provide a quantitative performance analysis of the different algorithms across the three frameworks. △ Less

Submitted 10 June, 2018; v1 submitted 23 January, 2018; originally announced January 2018.

arXiv:1801.03986 [pdf, other]

Multi-Task Spatiotemporal Neural Networks for Structured Surface Reconstruction

Authors: Mingze Xu, Chenyou Fan, John D Paden, Geoffrey C Fox, David J Crandall

Abstract: Deep learning methods have surpassed the performance of traditional techniques on a wide range of problems in computer vision, but nearly all of this work has studied consumer photos, where precisely correct output is often not critical. It is less clear how well these techniques may apply on structured prediction problems where fine-grained output with high precision is required, such as in scien… ▽ More Deep learning methods have surpassed the performance of traditional techniques on a wide range of problems in computer vision, but nearly all of this work has studied consumer photos, where precisely correct output is often not critical. It is less clear how well these techniques may apply on structured prediction problems where fine-grained output with high precision is required, such as in scientific imaging domains. Here we consider the problem of segmenting echogram radar data collected from the polar ice sheets, which is challenging because segmentation boundaries are often very weak and there is a high degree of noise. We propose a multi-task spatiotemporal neural network that combines 3D ConvNets and Recurrent Neural Networks (RNNs) to estimate ice surface boundaries from sequences of tomographic radar images. We show that our model outperforms the state-of-the-art on this problem by (1) avoiding the need for hand-tuned parameters, (2) extracting multiple surfaces (ice-air and ice-bed) simultaneously, (3) requiring less non-visual metadata, and (4) being about 6 times faster. △ Less

Submitted 20 July, 2018; v1 submitted 11 January, 2018; originally announced January 2018.

Comments: 10 pages, 7 figures, published in WACV 2018

arXiv:1712.07758 [pdf, ps, other]

Automatic Estimation of Ice Bottom Surfaces from Radar Imagery

Authors: Mingze Xu, David J Crandall, Geoffrey C Fox, John D Paden

Abstract: Ground-penetrating radar on planes and satellites now makes it practical to collect 3D observations of the subsurface structure of the polar ice sheets, providing crucial data for understanding and tracking global climate change. But converting these noisy readings into useful observations is generally done by hand, which is impractical at a continental scale. In this paper, we propose a computer… ▽ More Ground-penetrating radar on planes and satellites now makes it practical to collect 3D observations of the subsurface structure of the polar ice sheets, providing crucial data for understanding and tracking global climate change. But converting these noisy readings into useful observations is generally done by hand, which is impractical at a continental scale. In this paper, we propose a computer vision-based technique for extracting 3D ice-bottom surfaces by viewing the task as an inference problem on a probabilistic graphical model. We first generate a seed surface subject to a set of constraints, and then incorporate additional sources of evidence to refine it via discrete energy minimization. We evaluate the performance of the tracking algorithm on 7 topographic sequences (each with over 3000 radar images) collected from the Canadian Arctic Archipelago with respect to human-labeled ground truth. △ Less

Submitted 20 December, 2017; originally announced December 2017.

Comments: 5 pages, 3 figures, published in ICIP 2017

arXiv:1708.08028 [pdf]

doi 10.13140/RG.2.2.15007.87206

Status of Serverless Computing and Function-as-a-Service(FaaS) in Industry and Research

Authors: Geoffrey C. Fox, Vatche Ishakian, Vinod Muthusamy, Aleksander Slominski

Abstract: This whitepaper summarizes issues raised during the First International Workshop on Serverless Computing (WoSC) 2017 held June 5th 2017 and especially in the panel and associated discussion that concluded the workshop. We also include comments from the keynote and submitted papers. A glossary at the end (section 8) defines many technical terms used in this report. This whitepaper summarizes issues raised during the First International Workshop on Serverless Computing (WoSC) 2017 held June 5th 2017 and especially in the panel and associated discussion that concluded the workshop. We also include comments from the keynote and submitted papers. A glossary at the end (section 8) defines many technical terms used in this report. △ Less

Submitted 26 August, 2017; originally announced August 2017.

Comments: Technical Report

arXiv:1403.1528 [pdf, other]

A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

Authors: Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha, Geoffrey C. Fox

Abstract: Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-perform… ▽ More Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions. △ Less

Submitted 22 June, 2014; v1 submitted 6 March, 2014; originally announced March 2014.

Comments: 8 pages, 2 figures

Showing 1–47 of 47 results for author: Fox, G