-
Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers
Authors:
Thomas Bouvier,
Bogdan Nicolae,
Hugo Chaugier,
Alexandru Costan,
Ian Foster,
Gabriel Antoniu
Abstract:
Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new traini…
▽ More
Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new training data becomes available would result in extremely long training times and massive data accumulation. Rehearsal-based continual learning has shown promise for addressing the catastrophic forgetting challenge, but research to date has not addressed performance and scalability. To fill this gap, we propose an approach based on a distributed rehearsal buffer that efficiently complements data-parallel training on multiple GPUs, allowing us to achieve short runtime and scalability while retaining high accuracy. It leverages a set of buffers (local to each GPU) and uses several asynchronous techniques for updating these local buffers in an embarrassingly parallel fashion, all while handling the communication overheads necessary to augment input mini-batches (groups of training samples fed to the model) using unbiased, global sampling. In this paper we explore the benefits of this approach for classification models. We run extensive experiments on up to 128 GPUs of the ThetaGPU supercomputer to compare our approach with baselines representative of training-from-scratch (the upper bound in terms of accuracy) and incremental training (the lower bound). Results show that rehearsal-based continual learning achieves a top-5 classification accuracy close to the upper bound, while simultaneously exhibiting a runtime close to the lower bound.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
KheOps: Cost-effective Repeatability, Reproducibility, and Replicability of Edge-to-Cloud Experiments
Authors:
Daniel Rosendo,
Kate Keahey,
Alexandru Costan,
Matthieu Simonin,
Patrick Valduriez,
Gabriel Antoniu
Abstract:
Distributed infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex scientific workflows to be executed across hybrid systems spanning from IoT Edge devices to Clouds, and sometimes to supercomputers (the Computing Continuum). Understanding the performance trade-offs of large-scale workflows deployed on such complex Edge-to-Cloud Continuu…
▽ More
Distributed infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex scientific workflows to be executed across hybrid systems spanning from IoT Edge devices to Clouds, and sometimes to supercomputers (the Computing Continuum). Understanding the performance trade-offs of large-scale workflows deployed on such complex Edge-to-Cloud Continuum is challenging. To achieve this, one needs to systematically perform experiments, to enable their reproducibility and allow other researchers to replicate the study and the obtained conclusions on different infrastructures. This breaks down to the tedious process of reconciling the numerous experimental requirements and constraints with low-level infrastructure design choices.To address the limitations of the main state-of-the-art approaches for distributed, collaborative experimentation, such as Google Colab, Kaggle, and Code Ocean, we propose KheOps, a collaborative environment specifically designed to enable cost-effective reproducibility and replicability of Edge-to-Cloud experiments. KheOps is composed of three core elements: (1) an experiment repository; (2) a notebook environment; and (3) a multi-platform experiment methodology.We illustrate KheOps with a real-life Edge-to-Cloud application. The evaluations explore the point of view of the authors of an experiment described in an article (who aim to make their experiments reproducible) and the perspective of their readers (who aim to replicate the experiment). The results show how KheOps helps authors to systematically perform repeatable and reproducible experiments on the Grid5000 + FIT IoT LAB testbeds. Furthermore, KheOps helps readers to cost-effectively replicate authors experiments in different infrastructures such as Chameleon Cloud + CHI@Edge testbeds, and obtain the same conclusions with high accuracies (> 88% for all performance metrics).
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
ProvLight: Efficient Workflow Provenance Capture on the Edge-to-Cloud Continuum
Authors:
Daniel Rosendo,
Marta Mattoso,
Alexandru Costan,
Renan Souza,
Débora Pina,
Patrick Valduriez,
Gabriel Antoniu
Abstract:
Modern scientific workflows require hybrid infrastructures combining numerous decentralized resources on the IoT/Edge interconnected to Cloud/HPC systems (aka the Computing Continuum) to enable their optimized execution. Understanding and optimizing the performance of such complex Edge-to-Cloud workflows is challenging. Capturing the provenance of key performance indicators, with their related dat…
▽ More
Modern scientific workflows require hybrid infrastructures combining numerous decentralized resources on the IoT/Edge interconnected to Cloud/HPC systems (aka the Computing Continuum) to enable their optimized execution. Understanding and optimizing the performance of such complex Edge-to-Cloud workflows is challenging. Capturing the provenance of key performance indicators, with their related data and processes, may assist in understanding and optimizing workflow executions. However, the capture overhead can be prohibitive, particularly in resource-constrained devices, such as the ones on the IoT/Edge.To address this challenge, based on a performance analysis of existing systems, we propose ProvLight, a tool to enable efficient provenance capture on the IoT/Edge. We leverage simplified data models, data compression and grou**, and lightweight transmission protocols to reduce overheads. We further integrate ProvLight into the E2Clab framework to enable workflow provenance capture across the Edge-to-Cloud Continuum. This integration makes E2Clab a promising platform for the performance optimization of applications through reproducible experiments.We validate ProvLight at a large scale with synthetic workloads on 64 real-life IoT/Edge devices in the FIT IoT LAB testbed. Evaluations show that ProvLight outperforms state-of-the-art systems like ProvLake and DfAnalyzer in resource-constrained devices. ProvLight is 26 -- 37x faster to capture and transmit provenance data; uses 5 -- 7x less CPU; 2x less memory; transmits 2x less data; and consumes 2 -- 2.5x less energy. ProvLight and E2Clab are available as open-source tools.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Distributed intelligence on the Edge-to-Cloud Continuum: A systematic literature review
Authors:
Daniel Rosendo,
Alexandru Costan,
Patrick Valduriez,
Gabriel Antoniu
Abstract:
The explosion of data volumes generated by an increasing number of applications is strongly impacting the evolution of distributed digital infrastructures for data analytics and machine learning (ML). While data analytics used to be mainly performed on cloud infrastructures, the rapid development of IoT infrastructures and the requirements for low-latency, secure processing has motivated the devel…
▽ More
The explosion of data volumes generated by an increasing number of applications is strongly impacting the evolution of distributed digital infrastructures for data analytics and machine learning (ML). While data analytics used to be mainly performed on cloud infrastructures, the rapid development of IoT infrastructures and the requirements for low-latency, secure processing has motivated the development of edge analytics. Today, to balance various trade-offs, ML-based analytics tends to increasingly leverage an interconnected ecosystem that allows complex applications to be executed on hybrid infrastructures where IoT Edge devices are interconnected to Cloud/HPC systems in what is called the Computing Continuum, the Digital Continuum, or the Transcontinuum.Enabling learning-based analytics on such complex infrastructures is challenging. The large scale and optimized deployment of learning-based workflows across the Edge-to-Cloud Continuum requires extensive and reproducible experimental analysis of the application execution on representative testbeds. This is necessary to help understand the performance trade-offs that result from combining a variety of learning paradigms and supportive frameworks. A thorough experimental analysis requires the assessment of the impact of multiple factors, such as: model accuracy, training time, network overhead, energy consumption, processing latency, among others.This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today. It describes the main learning paradigms enabling learning-based analytics on the Edge-to-Cloud Continuum. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed. Furthermore, we analyze how the selected systems provide support for experiment reproducibility. We conclude our review with a detailed discussion of relevant open research challenges and of future directions in this domain such as: holistic understanding of performance; performance optimization of applications;efficient deployment of Artificial Intelligence (AI) workflows on highly heterogeneous infrastructures; and reproducible analysis of experiments on the Computing Continuum.
△ Less
Submitted 29 April, 2022;
originally announced May 2022.
-
Enabling Reproducible Analysis of Complex Workflows on the Edge-to-Cloud Continuum
Authors:
Daniel Rosendo,
Alexandru Costan,
Gabriel Antoniu,
Patrick Valduriez
Abstract:
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, t…
▽ More
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. We introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete experimental cycle across the Computing Continuum: deployment, analysis, optimization. Preliminary results with real-life use cases show that E2Clab allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure.
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum
Authors:
Daniel Rosendo,
Alexandru Costan,
Gabriel Antoniu,
Matthieu Simonin,
Jean-Christophe Lombardo,
Alexis Joly,
Patrick Valduriez
Abstract:
In more and more application areas, we are witnessing the emergence of complex workflows that combine computing, analytics and learning. They often require a hybrid execution infrastructure with IoT devices interconnected to cloud/HPC systems (aka Computing Continuum). Such workflows are subject to complex constraints and requirements in terms of performance, resource usage, energy consumption and…
▽ More
In more and more application areas, we are witnessing the emergence of complex workflows that combine computing, analytics and learning. They often require a hybrid execution infrastructure with IoT devices interconnected to cloud/HPC systems (aka Computing Continuum). Such workflows are subject to complex constraints and requirements in terms of performance, resource usage, energy consumption and financial costs. This makes it challenging to optimize their configuration and deployment. We propose a methodology to support the optimization of real-life applications on the Edge-to-Cloud Continuum. We implement it as an extension of E2Clab, a previously proposed framework supporting the complete experimental cycle across the Edge-to-Cloud Continuum. Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour and related performance trade-offs. We illustrate our methodology by optimizing Pl@ntNet, a world-wide plant identification application. Our methodology can be generalized to other applications in the Edge-to-Cloud Continuum.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.
-
A Survey of Benchmarks to Evaluate Data Analytics for Smart-* Applications
Authors:
Athanasios Kiatipis,
Alvaro Brandon,
Rizkallah Touma,
Pierre Matri,
Michal Zasadzinski,
Linh Thuy Nhuyen,
Adrien Lebre,
Alexandru Costan
Abstract:
The growth of ubiquitous sensor networks at an accelerating pace cuts across many areas of modern day life. They enable measuring, inferring, understanding and acting upon a wide variety of indicators, in fields ranging from agriculture to healthcare or to complex urban environments. The applications devoted to this task are designated as Smart-* Applications. They hide a staggering complexity, re…
▽ More
The growth of ubiquitous sensor networks at an accelerating pace cuts across many areas of modern day life. They enable measuring, inferring, understanding and acting upon a wide variety of indicators, in fields ranging from agriculture to healthcare or to complex urban environments. The applications devoted to this task are designated as Smart-* Applications. They hide a staggering complexity, relying on multiple layers of data collection, transmission, aggregation, analysis and also storage, both at the network edge and on the cloud. Furthermore, Smart-* Applications raise additional specific challenges, such as the need to process and extract knowledge from diverse data, which is flowing at high velocity in near real-time or in the heavily distributed environment they rely on. How to assess the performance of such a complex stack, when faced with the specifics of \mbox{Smart-*} Applications, remains an open research question. In this article, the key specific characteristics and requirements of Smart-* Applications are initially detailed. Afterwards, for each of these requirements, there is a description of the benchmarks one can use to precisely evaluate the performance of the underlying systems and technologies. Finally, an identification of future research directions related to identified open issues for benchmarking Smart-* Applications is performed.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
An Architectural Model for a Grid based Workflow Management Platform in Scientific Applications
Authors:
Alexandru Costan,
Florin Pop,
Corina Stratan,
Ciprian Dobre,
Catalin Leordeanu,
Valentin Cristea
Abstract:
With recent increasing computational and data requirements of scientific applications, the use of large clustered systems as well as distributed resources is inevitable. Although executing large applications in these environments brings increased performance, the automation of the process becomes more and more challenging. While the use of complex workflow management systems has been a viable solu…
▽ More
With recent increasing computational and data requirements of scientific applications, the use of large clustered systems as well as distributed resources is inevitable. Although executing large applications in these environments brings increased performance, the automation of the process becomes more and more challenging. While the use of complex workflow management systems has been a viable solution for this automation process in business oriented environments, the open source engines available for scientific applications lack some functionalities or are too difficult to use for non-specialists. In this work we propose an architectural model for a grid based workflow management platform providing features like an intuitive way to describe workflows, efficient data handling mechanisms and flexible fault tolerance support. Our integrated solution introduces a workflow engine component based on ActiveBPEL extended with additional functionalities and a scheduling component providing efficient map** between tasks and available resources.
△ Less
Submitted 29 June, 2011;
originally announced June 2011.
-
Models and Techniques for Ensuring Reliability, Safety, Availability and Security of Large Scale Distributed Systems
Authors:
Valentin Cristea,
Ciprian Dobre,
Florin Pop,
Corina Stratan,
Alexandru Costan,
Catalin Leordeanu
Abstract:
17th International Conference on Control Systems and Computer Science (CSCS 17), Bucharest, Romania, May 26-29, 2009. Vol. 1, pp. 401-406, ISSN: 2066-4451.
17th International Conference on Control Systems and Computer Science (CSCS 17), Bucharest, Romania, May 26-29, 2009. Vol. 1, pp. 401-406, ISSN: 2066-4451.
△ Less
Submitted 28 June, 2011;
originally announced June 2011.
-
Critical Analysis of Middleware Architectures for Large Scale Distributed Systems
Authors:
Florin Pop,
Ciprian Mihai Dobre,
Alexandru Costan,
Mugurel Ionut Andreica,
Eliana-Dina Tirsa,
Corina Stratan,
Valentin Cristea
Abstract:
Distributed computing is increasingly being viewed as the next phase of Large Scale Distributed Systems (LSDSs). However, the vision of large scale resource sharing is not yet a reality in many areas - Grid computing is an evolving area of computing, where standards and technology are still being developed to enable this new paradigm. Hence, in this paper we analyze the current development of mi…
▽ More
Distributed computing is increasingly being viewed as the next phase of Large Scale Distributed Systems (LSDSs). However, the vision of large scale resource sharing is not yet a reality in many areas - Grid computing is an evolving area of computing, where standards and technology are still being developed to enable this new paradigm. Hence, in this paper we analyze the current development of middleware tools for LSDS, from multiple perspectives: architecture, applications and market research. For each perspective we are interested in relevant technologies used in undergoing projects, existing products or services and useful design issues. In the end, based on this approach, we draw some conclusions regarding the future research directions in this area.
△ Less
Submitted 15 October, 2009;
originally announced October 2009.
-
Robust Failure Detection Architecture for Large Scale Distributed Systems
Authors:
Ciprian Mihai Dobre,
Florin Pop,
Alexandru Costan,
Mugurel Ionut Andreica,
Valentin Cristea
Abstract:
Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. There are lots of approaches and implementations in failure detectors. Providing flexible failure detection in off-the-shelf distributed systems is difficult. In this paper we present an innovative solution to this problem. Our approach is based on adaptive, decentralized failure de…
▽ More
Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. There are lots of approaches and implementations in failure detectors. Providing flexible failure detection in off-the-shelf distributed systems is difficult. In this paper we present an innovative solution to this problem. Our approach is based on adaptive, decentralized failure detectors, capable of working asynchronous and independent on the application flow. The proposed solution considers an architecture for the failure detectors, based on clustering, the use of a gossip-based algorithm for detection at local level and the use of a hierarchical structure among clusters of detectors along which traffic is channeled. The solution can scale to a large number of nodes, considers the QoS requirements of both applications and resources, and includes fault tolerance and system orchestration mechanisms, added in order to asses the reliability and availability of distributed systems.
△ Less
Submitted 5 October, 2009;
originally announced October 2009.
-
Towards a Grid Platform for Scientific Workflows Management
Authors:
Alexandru Costan,
Corina Stratan,
Eliana-Dina Tirsa,
Mugurel Ionut Andreica,
Valentin Cristea
Abstract:
Workflow management systems allow the users to develop complex applications at a higher level, by orchestrating functional components without handling the implementation details. Although a wide range of workflow engines are developed in enterprise environments, the open source engines available for scientific applications lack some functionalities or are too difficult to use for non-specialists…
▽ More
Workflow management systems allow the users to develop complex applications at a higher level, by orchestrating functional components without handling the implementation details. Although a wide range of workflow engines are developed in enterprise environments, the open source engines available for scientific applications lack some functionalities or are too difficult to use for non-specialists. Our purpose is to develop a workflow management platform for distributed systems, that will provide features like an intuitive way to describe workflows, efficient data handling mechanisms and flexible fault tolerance support. We introduce here an architectural model for the workflow platform, based on the ActiveBPEL workflow engine, which we propose to augment with an additional set of components.
△ Less
Submitted 4 October, 2009;
originally announced October 2009.
-
Offline Algorithms for Several Network Design, Clustering and QoS Optimization Problems
Authors:
Mugurel Ionut Andreica,
Eliana-Dina Tirsa,
Alexandru Costan,
Nicolae Tapus
Abstract:
In this paper we address several network design, clustering and Quality of Service (QoS) optimization problems and present novel, efficient, offline algorithms which compute optimal or near-optimal solutions. The QoS optimization problems consist of reliability improvement (by computing backup shortest paths) and network link upgrades (in order to reduce the latency on several paths). The networ…
▽ More
In this paper we address several network design, clustering and Quality of Service (QoS) optimization problems and present novel, efficient, offline algorithms which compute optimal or near-optimal solutions. The QoS optimization problems consist of reliability improvement (by computing backup shortest paths) and network link upgrades (in order to reduce the latency on several paths). The network design problems consist of determining small diameter networks, as well as very well connected and regular network topologies. The network clustering problems consider only the restricted model of static and mobile path networks, for which we were able to develop optimal algorithms.
△ Less
Submitted 1 June, 2009;
originally announced June 2009.