-
FLIPS: Federated Learning using Intelligent Participant Selection
Authors:
Rahul Atul Bhope,
K. R. Jayaram,
Nalini Venkatasubramanian,
Ashish Verma,
Gegi Thomas
Abstract:
This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori,…
▽ More
This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori, and during FL training, ensures that each cluster is equitably represented in the participants selected. FLIPS can support the most common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To manage platform heterogeneity and dynamic resource availability, FLIPS incorporates a straggler management mechanism to handle changing capacities in distributed, smart community applications. Privacy of label distributions, clustering and participant selection is ensured through a trusted execution environment (TEE). Our comprehensive empirical evaluation compares FLIPS with random participant selection, as well as three other "smart" selection mechanisms - Oort, TiFL and gradient clustering using two real-world datasets, two benchmark datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg). We demonstrate that FLIPS significantly improves convergence, achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication costs, and these benefits endure in the presence of straggler participants.
△ Less
Submitted 30 September, 2023; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Just-in-Time Aggregation for Federated Learning
Authors:
K. R. Jayaram,
Ashish Verma,
Gegi Thomas,
Vinod Muthusamy
Abstract:
The increasing number and scale of federated learning (FL) jobs necessitates resource efficient scheduling and management of aggregation to make the economics of cloud-hosted aggregation work. Existing FL research has focused on the design of FL algorithms and optimization, and less on the efficacy of aggregation. Existing FL platforms often employ aggregators that actively wait for model updates.…
▽ More
The increasing number and scale of federated learning (FL) jobs necessitates resource efficient scheduling and management of aggregation to make the economics of cloud-hosted aggregation work. Existing FL research has focused on the design of FL algorithms and optimization, and less on the efficacy of aggregation. Existing FL platforms often employ aggregators that actively wait for model updates. This wastes computational resources on the cloud, especially in large scale FL settings where parties are intermittently available for training.
In this paper, we propose a new FL aggregation paradigm -- "just-in-time" (JIT) aggregation that leverages unique properties of FL jobs, especially the periodicity of model updates, to defer aggregation as much as possible and free compute resources for other FL jobs or other datacenter workloads. We describe a novel way to prioritize FL jobs for aggregation, and demonstrate using multiple datasets, models and FL aggregation algorithms that our techniques can reduce resource usage by 60+\% when compared to eager aggregation used in existing FL platforms. We also demonstrate that using JIT aggregation has negligible overhead and impact on the latency of the FL job.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Adaptive Aggregation For Federated Learning
Authors:
K. R. Jayaram,
Vinod Muthusamy,
Gegi Thomas,
Ashish Verma,
Mark Purcell
Abstract:
Advances in federated learning (FL) algorithms,along with technologies like differential privacy and homomorphic encryption, have led to FL being increasingly adopted and used in many application domains. This increasing adoption has led to rapid growth in the number, size (number of participants/parties) and diversity (intermittent vs. active parties) of FL jobs. Many existing FL systems, based o…
▽ More
Advances in federated learning (FL) algorithms,along with technologies like differential privacy and homomorphic encryption, have led to FL being increasingly adopted and used in many application domains. This increasing adoption has led to rapid growth in the number, size (number of participants/parties) and diversity (intermittent vs. active parties) of FL jobs. Many existing FL systems, based on centralized (often single) model aggregators are unable to scale to handle large FL jobs and adapt to parties' behavior.
In this paper, we present a new scalable and adaptive architecture for FL aggregation. First, we demonstrate how traditional tree overlay based aggregation techniques (from P2P, publish-subscribe and stream processing research) can help FL aggregation scale, but are ineffective from a resource utilization and cost standpoint. Next, we present the design and implementation of AdaFed, which uses serverless/cloud functions to adaptively scale aggregation in a resource efficient and fault tolerant manner. We describe how AdaFed enables FL aggregation to be dynamically deployed only when necessary, elastically scaled to handle participant joins/leaves and is fault tolerant with minimal effort required on the (aggregation) programmer side. We also demonstrate that our prototype based on Ray scales to thousands of participants, and is able to achieve a >90% reduction in resource requirements and cost, with minimal impact on aggregation latency.
△ Less
Submitted 6 November, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Separation of Powers in Federated Learning
Authors:
Pau-Chen Cheng,
Kevin Eykholt,
Zhongshu Gu,
Hani Jamjoom,
K. R. Jayaram,
Enriquillo Valdez,
Ashish Verma
Abstract:
Federated Learning (FL) enables collaborative training among mutually distrusting parties. Model updates, rather than training data, are concentrated and fused in a central aggregation server. A key security challenge in FL is that an untrustworthy or compromised aggregation process might lead to unforeseeable information leakage. This challenge is especially acute due to recently demonstrated att…
▽ More
Federated Learning (FL) enables collaborative training among mutually distrusting parties. Model updates, rather than training data, are concentrated and fused in a central aggregation server. A key security challenge in FL is that an untrustworthy or compromised aggregation process might lead to unforeseeable information leakage. This challenge is especially acute due to recently demonstrated attacks that have reconstructed large fractions of training data from ostensibly "sanitized" model updates.
In this paper, we introduce TRUDA, a new cross-silo FL system, employing a trustworthy and decentralized aggregation architecture to break down information concentration with regard to a single aggregator. Based on the unique computational properties of model-fusion algorithms, all exchanged model updates in TRUDA are disassembled at the parameter-granularity and re-stitched to random partitions designated for multiple TEE-protected aggregators. Thus, each aggregator only has a fragmentary and shuffled view of model updates and is oblivious to the model architecture. Our new security mechanisms can fundamentally mitigate training reconstruction attacks, while still preserving the final accuracy of trained models and kee** performance overheads low.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
MYSTIKO : : Cloud-Mediated, Private, Federated Gradient Descent
Authors:
K. R. Jayaram,
Archit Verma,
Ashish Verma,
Gegi Thomas,
Colin Sutcher-Shepard
Abstract:
Federated learning enables multiple, distributed participants (potentially on different clouds) to collaborate and train machine/deep learning models by sharing parameters/gradients. However, sharing gradients, instead of centralizing data, may not be as private as one would expect. Reverse engineering attacks on plaintext gradients have been demonstrated to be practically feasible. Existing solut…
▽ More
Federated learning enables multiple, distributed participants (potentially on different clouds) to collaborate and train machine/deep learning models by sharing parameters/gradients. However, sharing gradients, instead of centralizing data, may not be as private as one would expect. Reverse engineering attacks on plaintext gradients have been demonstrated to be practically feasible. Existing solutions for differentially private federated learning, while promising, lead to less accurate models and require nontrivial hyperparameter tuning.
In this paper, we examine the use of additive homomorphic encryption (specifically the Paillier cipher) to design secure federated gradient descent techniques that (i) do not require addition of statistical noise or hyperparameter tuning, (ii) does not alter the final accuracy or utility of the final model, (iii) ensure that the plaintext model parameters/gradients of a participant are never revealed to any other participant or third party coordinator involved in the federated learning job, (iv) minimize the trust placed in any third party coordinator and (v) are efficient, with minimal overhead, and cost effective.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Effective Elastic Scaling of Deep Learning Workloads
Authors:
Vaibhav Saxena,
K. R. Jayaram,
Saurav Basu,
Yogish Sabharwal,
Ashish Verma
Abstract:
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner.
In this paper, we examine the elastic scaling of Deep Learning (DL)…
▽ More
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner.
In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs.
We demonstrate empirically that our elastic scaling algorithm can complete up to $\approx 2 \times$ as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to $\approx 10 \times$ faster than that of the baseline.
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
FfDL : A Flexible Multi-tenant Deep Learning Platform
Authors:
K. R. Jayaram,
Vinod Muthusamy,
Parijat Dube,
Vatche Ishakian,
Chen Wang,
Benjamin Herta,
Scott Boag,
Diana Arroyo,
Asser Tantawi,
Archit Verma,
Falk Pollok,
Rania Khalaf
Abstract:
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These…
▽ More
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.
This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
△ Less
Submitted 14 September, 2019;
originally announced September 2019.
-
Elastic Remote Methods
Authors:
K. R. Jayaram
Abstract:
For distributed applications to take full advantage of cloud computing systems, we need middleware systems
that allow developers to build elasticity management components right into the applications.
This paper describes the design and implementation of ElasticRMI, a middleware system that (1) enables application developers to dynamically change the number of (server) objects available to hand…
▽ More
For distributed applications to take full advantage of cloud computing systems, we need middleware systems
that allow developers to build elasticity management components right into the applications.
This paper describes the design and implementation of ElasticRMI, a middleware system that (1) enables application developers to dynamically change the number of (server) objects available to handle remote method invocations with respect to the application's workload, without requiring major changes to clients (invokers) of remote methods, (2) enables flexible elastic scaling by allowing developers to use a combination of resource utilization metrics and fine-grained application-specific information like the properties of internal data structures to drive scaling decisions, (3) provides a high-level programming framework that handles elasticity at the level of classes and objects, masking low-level platform specific tasks (like provisioning VM images) from the developer, and (4) increases the portability of ElasticRMI applications across different private data centers/IaaS clouds through Apache Mesos.
△ Less
Submitted 7 September, 2019;
originally announced September 2019.
-
Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform
Authors:
Scott Boag,
Parijat Dube,
Kaoutar El Maghraoui,
Benjamin Herta,
Waldemar Hummer,
K. R. Jayaram,
Rania Khalaf,
Vinod Muthusamy,
Michael Kalantar,
Archit Verma
Abstract:
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.
This paper explores dependability in the context of a DLaaS platform used…
▽ More
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale.
This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
IBM Deep Learning Service
Authors:
Bishwaranjan Bhattacharjee,
Scott Boag,
Chandani Doshi,
Parijat Dube,
Ben Herta,
Vatche Ishakian,
K. R. Jayaram,
Rania Khalaf,
Avesh Krishna,
Yu Bo Li,
Vinod Muthusamy,
Ruchir Puri,
Yufei Ren,
Florian Rosenberg,
Seetharami R. Seelam,
Yandong Wang,
Jian Ming Zhang,
Li Zhang
Abstract:
Deep learning driven by large neural network models is overtaking traditional machine learning methods for understanding unstructured and perceptual data domains such as speech, text, and vision. At the same time, the "as-a-Service"-based business model on the cloud is fundamentally transforming the information technology industry. These two trends: deep learning, and "as-a-service" are colliding…
▽ More
Deep learning driven by large neural network models is overtaking traditional machine learning methods for understanding unstructured and perceptual data domains such as speech, text, and vision. At the same time, the "as-a-Service"-based business model on the cloud is fundamentally transforming the information technology industry. These two trends: deep learning, and "as-a-service" are colliding to give rise to a new business model for cognitive application delivery: deep learning as a service in the cloud. In this paper, we will discuss the details of the software architecture behind IBM's deep learning as a service (DLaaS). DLaaS provides developers the flexibility to use popular deep learning libraries such as Caffe, Torch and TensorFlow, in the cloud in a scalable and resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.
△ Less
Submitted 18 September, 2017;
originally announced September 2017.