-
Multi-FedLS: a Framework for Cross-Silo Federated Learning Applications on Multi-Cloud Environments
Authors:
Rafaela C. Brum,
Maria Clicia Stelling de Castro,
Luciana Arantes,
Lúcia Maria de A. Drummond,
Pierre Sens
Abstract:
Federated Learning (FL) is a distributed Machine Learning (ML) technique that can benefit from cloud environments while preserving data privacy. We propose Multi-FedLS, a framework that manages multi-cloud resources, reducing execution time and financial costs of Cross-Silo Federated Learning applications by using preemptible VMs, cheaper than on-demand ones but that can be revoked at any time. Ou…
▽ More
Federated Learning (FL) is a distributed Machine Learning (ML) technique that can benefit from cloud environments while preserving data privacy. We propose Multi-FedLS, a framework that manages multi-cloud resources, reducing execution time and financial costs of Cross-Silo Federated Learning applications by using preemptible VMs, cheaper than on-demand ones but that can be revoked at any time. Our framework encloses four modules: Pre-Scheduling, Initial Map**, Fault Tolerance, and Dynamic Scheduler. This paper extends our previous work \cite{brum2022sbac} by formally describing the Multi-FedLS resource manager framework and its modules. Experiments were conducted with three Cross-Silo FL applications on CloudLab and a proof-of-concept confirms that Multi-FedLS can be executed on a multi-cloud composed by AWS and GCP, two commercial cloud providers. Results show that the problem of executing Cross-Silo FL applications in multi-cloud environments with preemptible VMs can be efficiently resolved using a mathematical formulation, fault tolerance techniques, and a simple heuristic to choose a new VM in case of revocation.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Scheduling Bag-of-Tasks in Clouds using Spot and Burstable Virtual Machines
Authors:
Luan Teylo,
Luciana Arantes,
Pierre Sens,
Lúcia Maria de A. Drummond
Abstract:
Leading Cloud providers offer several types of Virtual Machines (VMs) in diverse contract models, with different guarantees in terms of availability and reliability. Among them, the most popular contract models are the on-demand and the spot models. In the former, on-demand VMs are allocated for a fixed cost per time unit, and their availability is ensured during the whole execution. On the other…
▽ More
Leading Cloud providers offer several types of Virtual Machines (VMs) in diverse contract models, with different guarantees in terms of availability and reliability. Among them, the most popular contract models are the on-demand and the spot models. In the former, on-demand VMs are allocated for a fixed cost per time unit, and their availability is ensured during the whole execution. On the other hand, in the spot market, VMs are offered with a huge discount when compared to the on-demand VMs, but their availability fluctuates according to the cloud's current demand that can terminate or hibernate a spot VM at any time. Furthermore, in order to cope with workload variations, cloud providers have also introduced the concept of burstable VMs which are able to burst up their respective baseline CPU performance during a limited period of time with an up to 20% discount when compared to an equivalent non-burstable on-demand VMs. In the current work, we present the Burst Hibernation-Aware Dynamic Scheduler (Burst-HADS), a framework that schedules and executes tasks of Bag-of-Tasks applications with deadline constraints by exploiting spot and on-demand burstable VMs, aiming at minimizing both the monetary cost and the execution time. Based on ILS metaheuristics, Burst-HADS defines an initial scheduling map of tasks to VMs which can then be dynamically altered by migrating tasks of a hibernated spot VM or by performing work-stealing when VMs become idle. Performance results on Amazon EC2 cloud with different applications show that, when compared to a solution that uses only regular on-demand instances, Burst-HADS reduces the monetary cost of the execution and meet the application deadline even in scenarios with high spot hibernation rates. It also reduces the total execution time when compared to a solution that uses only spot and non-burstable on-demand instances.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds
Authors:
Luan Teylo,
Lúcia Maria de A. Drummond,
Luciana Arantes,
Pierre Sens
Abstract:
Cloud platforms have emerged as a prominent environment to execute high performance computing (HPC) applications providing on-demand resources as well as scalability. They usually offer different classes of Virtual Machines (VMs) which ensure different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in Amazon EC2 clo…
▽ More
Cloud platforms have emerged as a prominent environment to execute high performance computing (HPC) applications providing on-demand resources as well as scalability. They usually offer different classes of Virtual Machines (VMs) which ensure different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for lower price. Despite the monetary advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any moment.
Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper a static scheduling for HPC applications which are composed by independent tasks (bag-of-task) with deadline constraints. However, if a spot VM hibernates and it does not resume within a time which guarantees the application's deadline, a temporal failure takes place. Our scheduling, thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline and avoiding temporal failures. To this end, our algorithm statically creates two scheduling maps: (i) the first one contains, for each task, its starting time and on which VM (i.e., an available spot or on-demand VM with the current lowest price) the task should execute; (ii) the second one contains, for each task allocated on a VM spot in the first map, its starting time and on which on-demand VM it should be executed to meet the application deadline in order to avoid temporal failures. The latter will be used whenever the hibernation period of a spot VM exceeds a time limit.
Performance results from simulation with task execution traces, configuration of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of our scheduling and that it tolerates temporal failures.
△ Less
Submitted 24 October, 2018;
originally announced October 2018.
-
A Quantitative Model for Predicting Cross-application Interference in Virtual Environments
Authors:
Maicon Melo Alves,
Lúcia Maria de Assumpção Drummond
Abstract:
Cross-application interference can affect drastically performance of HPC applications when running in clouds. This problem is caused by concurrent access performed by co-located applications to shared and non-sliceable resources such as cache and memory. In order to address this issue, some works adopted a qualitative approach that does not take into account the amount of access to shared resource…
▽ More
Cross-application interference can affect drastically performance of HPC applications when running in clouds. This problem is caused by concurrent access performed by co-located applications to shared and non-sliceable resources such as cache and memory. In order to address this issue, some works adopted a qualitative approach that does not take into account the amount of access to shared resources. In addition, a few works, even considering the amount of access, evaluated just the SLLC access contention as the root of this problem. However, our experiments revealed that interference is intrinsically related to the amount of simultaneous access to shared resources, besides showing that another shared resources, apart from SLLC, can also influence the interference suffered by co-located applications. In this paper, we present a quantitative model for predicting cross-application interference in virtual environments. Our proposed model takes into account the amount of simultaneous access to SLLC, DRAM and virtual network, and the similarity of application's access burden to predict the level of interference suffered by applications when co-located in a same physical machine. Experiments considering a real petroleum reservoir simulator and applications from HPCC benchmark showed that our model reached an average and maximum prediction errors around 4\% and 12\%, besides achieving an error less than 10\% in approximately 96\% of all tested cases.
△ Less
Submitted 13 October, 2016;
originally announced October 2016.
-
Solving the Quadratic Assignment Problem on heterogeneous environment (CPUs and GPUs) with the application of Level 2 Reformulation and Linearization Technique
Authors:
Alexandre Domingues Gonçalves,
Artur Alves Pessoa,
Lúcia Maria de Assumpção Drummond,
Cristiana Bentes,
Ricardo Farias
Abstract:
The Quadratic Assignment Problem, QAP, is a classic combinatorial optimization problem, classified as NP-hard and widely studied. This problem consists in assigning N facilities to N locations obeying the relation of 1 to 1, aiming to minimize costs of the displacement between the facilities. The application of Reformulation and Linearization Technique, RLT, to the QAP leads to a tight linear rela…
▽ More
The Quadratic Assignment Problem, QAP, is a classic combinatorial optimization problem, classified as NP-hard and widely studied. This problem consists in assigning N facilities to N locations obeying the relation of 1 to 1, aiming to minimize costs of the displacement between the facilities. The application of Reformulation and Linearization Technique, RLT, to the QAP leads to a tight linear relaxation but large and difficult to solve. Previous works based on level 3 RLT needed about 700GB of working memory to process one large instances (N = 30 facilities). We present a modified version of the algorithm proposed by Adams et al. which executes on heterogeneous systems (CPUs and GPUs), based on level 2 RLT. For some instances, our algorithm is up to 140 times faster and occupy 97% less memory than the level 3 RLT version. The proposed algorithm was able to solve by first time two instances: tai35b and tai40b.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.