Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

Li, Youjie; Phanishayee, Amar; Murray, Derek; Tarnawski, Jakub; Kim, Nam Sung

doi:10.14778/3551793.3551828

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2202.01306 (cs)

[Submitted on 2 Feb 2022 (v1), last revised 1 Aug 2022 (this version, v2)]

Title:Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

Authors:Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim

View PDF

Abstract:Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swap** to/from CPU memory) incur excessive swap** overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.

Comments:	Accepted at VLDB 2022
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2202.01306 [cs.DC]
	(or arXiv:2202.01306v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2202.01306
Related DOI:	https://doi.org/10.14778/3551793.3551828

Submission history

From: Youjie Li [view email]
[v1] Wed, 2 Feb 2022 22:16:27 UTC (9,443 KB)
[v2] Mon, 1 Aug 2022 17:12:36 UTC (8,883 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators