Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Almasi, Hamidreza; Mishra, Harsh; Vamanan, Balajee; Ravi, Sathya N.

Computer Science > Machine Learning

arXiv:2302.05865 (cs)

[Submitted on 12 Feb 2023 (v1), last revised 25 Sep 2023 (this version, v2)]

Title:Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Authors:Hamidreza Almasi, Harsh Mishra, Balajee Vamanan, Sathya N. Ravi

View PDF

Abstract:Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios $\in (0,1]$, and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks. The code is publicly available at this https URL

Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2302.05865 [cs.LG]
	(or arXiv:2302.05865v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2302.05865

Submission history

From: Hamidreza Almasi [view email]
[v1] Sun, 12 Feb 2023 06:38:30 UTC (3,613 KB)
[v2] Mon, 25 Sep 2023 00:02:10 UTC (5,241 KB)

Computer Science > Machine Learning

Title:Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators