MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Hsia, Samuel; Golden, Alicia; Acun, Bilge; Ardalani, Newsha; DeVito, Zachary; Wei, Gu-Yeon; Brooks, David; Wu, Carole-Jean

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2310.02784v2 (cs)

[Submitted on 4 Oct 2023 (v1), revised 18 Oct 2023 (this version, v2), latest version 10 Jun 2024 (v3)]

Title:MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Authors:Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zachary DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

View PDF

Abstract:Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlap** computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Cite as:	arXiv:2310.02784 [cs.DC]
	(or arXiv:2310.02784v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2310.02784

Submission history

From: Bilge Acun [view email]
[v1] Wed, 4 Oct 2023 13:00:53 UTC (5,737 KB)
[v2] Wed, 18 Oct 2023 15:29:21 UTC (5,735 KB)
[v3] Mon, 10 Jun 2024 20:31:07 UTC (8,267 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators