HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: scalerel

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2211.16648v2 [cs.DC] 14 Mar 2024

\TheName: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Divya Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Bambhaniya, Tushar Krishna, Alexandros Daglis Georgia Institute of Technology
Abstract

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization—to amortize their steep cost—is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model’s tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster’s characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce \TheName, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate \TheName’s utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7×7.7\times7.7 × and highlight performance optimization opportunities of up to 1.4×1.4\times1.4 × when employing memory expansion as an optimization technique.

I Introduction

Modern Deep Learning (DL) workloads such as natural language processing [68, 78], drug discovery [19, 72], text-to-speech conversion [3, 60, 70], and personal recommendation engines [46, 83] are becoming increasingly pervasive and commercially important, forming the core components of many day-to-day applications and services deployed in datacenters. The rapid growth of these models has culminated in massive resource requirements (terabytes of memory, petaflops of compute) to train in a practical timeframe. Training is therefore conducted in a distributed fashion over massive clusters of high-end specialized accelerator nodes, such as GPUs or TPUs, connected over high-bandwidth networks [28, 44]. Given the abundance of available compute (GPU, TPU, custom accelerator), network (Ethernet, InfiniBand, NVLink), and memory system (HBM, DRAM over DDR or CXL) technologies, as well as the steep cost of such clusters, designing them for maximum performance and efficiency is a challenge of high complexity and critical importance.

Optimizing a cluster for distributed DL training requires keen understanding of key model characteristics, training strategies, and hardware components. Maximizing efficiency not only requires carefully balancing the cluster’s compute, memory, and network characteristics, but also adjusting the model’s parallelization strategy to best suit those underlying hardware resources. Therefore, a holistic, end-to-end methodology is needed to analyze and understand the impact of different system components and training strategies on each cluster’s DL training performance and efficiency.

Refer to caption
Figure 1: Tunable cluster component parameters in \TheNameand value ranges evaluated in this paper.

To address this need, we introduce the \TheNamemethodology, which allows rapid joint exploration of cluster resource provisioning and training parallelization strategies. \TheNameincludes a detailed workflow to break down a model into its layers, analyze its training behavior as a function of the employed parallelization strategy, and assess the impact of a cluster’s key resources on training performance and efficiency. \TheNameis a versatile tool that helps cluster designers determine the optimal resource provisioning balance for a set of target training algorithms. The methodology also enables researchers and technologists to study and quickly quantify the impact of emerging technologies or of any modification of a given component’s characteristics (e.g., per-node compute density, memory capacity or bandwidth, network bandwidth or latency, etc.) on distributed DL training performance and efficiency. For instance, a timely question is whether upcoming CXL-enabled memory expansion can be leveraged to improve a cluster’s DL training performance. We use \TheNameto determine the capacity and bandwidth characteristics required by such a new memory solution to be impactful. To illustrate, Fig. 1 summarizes the hardware components of the clusters modeled in \TheName, along with the range of values evaluated in this paper. Overall, by enabling rapid and holistic studies, \TheNameinforms cluster designers with a resource provisioning balance that maximizes training efficiency for a target set of DL models.

In summary, we make the following contributions:

  • We construct the holistic \TheNamemethodology to enable rapid design space co-exploration of model training parallelization strategies and key cluster resource parameters, to assess their joint impact on the performance of distributed DL training.

  • We implement our methodology with a streamlined toolchain and showcase its utility with case studies using different large Deep Learning Recommendation Models (DLRM) and language (Transformer) models.

  • \TheName

    helps system architects rapidly quantify the impact of various current and future accelerators, network capabilities, and memory system technologies on cluster performance and efficiency. We particularly emphasize the potential of memory expansion techniques (e.g., CXL-attached memory) as an understudied yet promising cluster design knob.

Paper outline: § II covers background and related work. § III details \TheName’s methodology and § IV its implementation. § V demonstrates \TheName’s utility with use cases and § VI concludes.

II Background and Related Work

II-A Distributed DL Parallelization Strategies

In recent years, we have witnessed tremendous growth in DL model sizes, with current largest models already trending in the 100s of billions or trillions of parameters [15, 38, 68]. Training models of such size requires tremendous computational and memory resources, only attainable on massive clusters with hundreds of (typically GPU-based) computing nodes. Given a cluster deployment, there is a range of parallelization strategies to choose for distributed DL training such as Data Parallelism (DP) [37], Model Parallelism (MP) [65, 66], Pipeline Parallelism (PP) [21], and more general parallelization strategies (e.g., expert parallelism [15]). \TheNamecurrently focuses on analyzing the effects of the two foundational parallelization strategies of MP and DP111Fully Sharded Data Parallelism (FSDP) [4, 50, 84] is a popular training strategy logically similar to DP, but stages data trough the GPU’s local memory and its adjacent CPU host’s memory. \TheNamecaptures such staging between local and expanded memory, hence representing FSDP-style training as well., but its modular design allows future extension to model other strategies as well.

In MP training, the model is split across the nodes, hence each node holds a shard of the model, requiring frequent inter-node communication both during forward and backward propagation. In contrast, in DP training, each node holds the entire model and the training batch is sharded across nodes. Inter-node communication is only required on backward propagation to reconcile weight updates across the multiple model copies. As a result, MP training generally requires more frequent synchronous inter-node communication, while DP training results in less frequent but more voluminous communication, which is easier to overlap with computations.

II-B Challenges in Distributed DL Training

The performance and efficiency of a cluster used in distributed DL training is dictated by a range of key parameters: per-node computational capability, memory capacity and bandwidth, and the inter-node network bandwidth. In addition to raw hardware capabilities, a training task’s performance is affected by the training task’s structure (i.e., the chosen parallelization strategy), as that affects the resulting workload characteristics and, consequently, how each cluster resource is being stressed. Limiting our scope to the two foundational MP and DP parallelization strategies, the chosen MP/DP balance for a given training task drastically affects the workload’s characteristics, bearing significant implications in the resulting computation/communication balance, per-node memory capacity requirements, and, ultimately, the distributed training’s performance and efficiency. Therefore, optimizing training performance requires holistic co-design of the parallelization strategy alongside the cluster’s memory, compute, and network resources. The \TheNamemethodology enables composite studies where cluster resource provisioning strategies and parallelization strategies for a training task are jointly varied.

II-C Prior Work on Distributed DL Training

DL training accelerator design. A vast body of prior work focuses on optimizing the individual accelerator node (GPU or custom accelerator) [8, 9, 29, 33, 51, 52, 63]. However, node design in isolation does not capture cluster-scale effects during distributed training and can lead to resource imbalance and cluster under-utilization. Maximizing cluster-wide performance and utilization requires a holistic design approach that jointly considers the impact of compute, network, memory, and workload parallelization strategy.

Cluster-scale DL training performance analysis. Closer to our work, some prior work characterizes the performance of large-scale distributed training on current clusters. Jain et al. evaluate the performance of training ResNet and Inception-based models on different CPU and GPU architectures, as a function of batch size, node count, and threads per node [23]. Ren et al. analyze the distributed training performance of NLP and computer vision workloads leading-edge systems, providing insights into the collective communication kernels and the impact of node count scaling on overall training throughput [61]. Jeon et al. focus on multi-tenant GPU clusters hosting co-located DNN training workloads, study how locality-aware scheduling affects performance and utilization, and propose guidelines for improved cluster schedulers [25].

Cluster communication performance optimizations. Another family of work focuses on cluster communication performance, which is often a major performance determinant. Dong et al. propose an algorithm/system co-design methodology to improve training scalability, by alleviating network congestion with a congestion-less server architecture that uses novel communication collective algorithms and network topology [14]. Jiang et al. accelerate DNN training with a unified communication framework that dynamically adapts reduction collectives and parameter server tasks to best utilize CPU and bandwidth resources, and overlap communication latency [27]. Shah et al. and Sun et al. propose communication abstraction and improvised communication algorithms [64, 69].

Memory system impact on training. Memory system design and optimizations play a crucial role in the overall performance and efficiency of a distributed training cluster, as well as the cluster size required to train a given large model. Prior works such as Checkmate [24], ZeRO-DP [53], ZeRO-Offload [59], and ZeRO-Infinity [54] focus on reducing the per-node memory footprint required to train large DL models.

Training strategy auto-tuning frameworks. Prior works on auto-tuning frameworks (Alpa [85], AutoDDL [7], Rhino [82], FlexFlow [26]) focus on identifying the optimal parallelization strategy using a combination of data, operator and communication patterns. These frameworks perform an exhaustive design space search to identify the optimal parallelization strategy for a given cluster. While auto-tuning frameworks predict the optimal parallelization strategy for a DL model training task on a specific cluster, \TheNamedevelops a generic methodology to jointly evaluate the performance of a training strategy on an arbitrary cluster at an earlier design stage: when a cluster’s architecture is still malleable, i.e., going under design considerations. Our goal is not to identify the best parallelization strategy for training a given DL model on an existing cluster, but rather to enable rapid design space exploration and evaluation of impact of various model parallelization strategies along with key architectural cluster design parameters. \TheNamecurrently does not automate the workload and cluster design space exploration. A future extension could add a frontend layer in the toolchain that automates generation of workload input files (representing different parallelization strategies) and cluster configurations (i.e., steps 2⃝ and 5⃝ in Fig. 5, described in § IV).

All aforementioned categories of prior work provide valuable insights to understand the performance characteristics of individual components of a large training cluster. However, a cluster design approach that jointly considers the training parallelization strategy along with the cluster’s key architectural parameters is essential to not only maximize performance and efficiency, but also identify the true impact any individual or combination of future changes will have on the metrics of interest. We, therefore, argue for a holistic cluster design methodology encapsulating: (i) the target model to be trained with the range of possible parallelization strategies, each node’s (ii) compute and (iii) memory subsystem, and (iv) the cluster’s network222In this paper, the term “node” refers to one compute unit (e.g., a GPU, a CPU, a TPU, etc.).. Understanding these tradeoffs at scale is challenging, and we are not aware of a publicly available unified framework that allows researchers to perform such broad design space exploration quickly. \TheNameis a methodology and toolchain that aims to fill that gap, facilitating rapid iterations of algorithm/hardware co-design for large-scale DL training.

III The \TheNameMethodology

Refer to caption
Figure 2: \TheNamemethodology overview.

We design the \TheNamemethodology to holistically evaluate a large-scale model’s training strategy on an arbitrary cluster and assess the best combination of training strategy and cluster resource balance to maximize training performance and/or cost efficiency. Fig. 2 shows a high-level overview of our methodology and this section details each of its steps.

III-A Workload Modeling

The first step involves decomposing the model of interest into its layers, along with the number of operations and data movement requirements in each layer. We express each layer as a general matrix-matrix multiplication (GEMM) between input activations (M×K𝑀𝐾M\times Kitalic_M × italic_K) and weights (K×N𝐾𝑁K\times Nitalic_K × italic_N), producing an output matrix (M×N𝑀𝑁M\times Nitalic_M × italic_N). After decomposing the model into layers, we compute the number of parameters for the operand matrices of each layer. The total size of a model, in terms of number of parameters, is given by the sum of the weight matrices’ (i.e., K×N𝐾𝑁K\times Nitalic_K × italic_N) elements [73]. Layers that cannot be encoded as GEMMs (e.g., embedding-lookups) are represented by their input/output operand sizes, total number of operations and data moved between memory and the compute unit.

Refer to caption
Figure 3: Variation of per-node memory capacity requirements as a function of MP and DP degrees in a fixed-size cluster.

III-B Training Strategy Configuration

In this step, we determine the parallelization strategy for the selected model based on the model type and per-node memory capacity. Different strategies focus on optimizing training for different metrics such as throughput, compute utilization, inter-node communication, etc. The current version of \TheNamefocuses on Data and Model parallelism (MP and DP—see § II-A). Given a target cluster size, we compute the per-node memory capacity requirement as a function of the selected degree of MP and DP in the cluster. We compute the per-node memory footprint to hold all the data (model, input/output matrices) required for the distributed DL training task. The model’s memory footprint is dictated by model states and activations. We compute the memory footprint required for each operand matrix based on its type (i.e., model weights or input/output activations), size of parameters, and the memory optimizations (e.g., ZeRO-DP, as explained later in Section § IV). For a cluster of size N, we start with the initial condition where all the nodes are within the MP dimension (i.e., MP=N,DP=1formulae-sequence𝑀𝑃𝑁𝐷𝑃1MP=N,DP=1italic_M italic_P = italic_N , italic_D italic_P = 1) and, collectively, hold a single copy of the entire model. Then we sweep the (MP, DP) degree to the other extreme (i.e., MP=1,DP=Nformulae-sequence𝑀𝑃1𝐷𝑃𝑁MP=1,DP=Nitalic_M italic_P = 1 , italic_D italic_P = italic_N), considering all power-of-two combinations, with MP×DP=N𝑀𝑃𝐷𝑃𝑁MP\times DP=Nitalic_M italic_P × italic_D italic_P = italic_N.

Doubling the DP dimension implies halving the MP dimension, so half the number of nodes must hold an entire copy of the model. Consequently, the per-node memory capacity requirement doubles. To illustrate, Fig. 3 shows a cluster configured as two data-parallel node groups (DP=2𝐷𝑃2DP=2italic_D italic_P = 2) and each DP node group consisting of m𝑚mitalic_m nodes performing m𝑚mitalic_m-way model parallelism (MP=m𝑀𝑃𝑚MP=mitalic_M italic_P = italic_m), for a total node count N=2m𝑁2𝑚N=2mitalic_N = 2 italic_m. To contain a copy of the entire model of size C across each DP group’s m𝑚mitalic_m nodes, each node requires a minimum memory capacity P=Cm𝑃𝐶𝑚P=\frac{C}{m}italic_P = divide start_ARG italic_C end_ARG start_ARG italic_m end_ARG. Moving from a DP=2,MP=mformulae-sequence𝐷𝑃2𝑀𝑃𝑚DP=2,MP=mitalic_D italic_P = 2 , italic_M italic_P = italic_m to a DP=4,MP=m2formulae-sequence𝐷𝑃4𝑀𝑃𝑚2DP=4,MP=\frac{m}{2}italic_D italic_P = 4 , italic_M italic_P = divide start_ARG italic_m end_ARG start_ARG 2 end_ARG configuration doubles the per-node memory capacity requirement to 2P2𝑃2P2 italic_P, as each DP group of m2𝑚2\frac{m}{2}divide start_ARG italic_m end_ARG start_ARG 2 end_ARG nodes must now hold the entire model C𝐶Citalic_C. As a result, the cluster’s total memory capacity also doubles from 2mP2𝑚𝑃2mP2 italic_m italic_P to 4mP4𝑚𝑃4mP4 italic_m italic_P.

The chosen (MP, DP) degree not only affects the memory requirement per node, but also results in different computation requirements and communication behavior. Therefore, for each (MP, DP) combination, we also derive the volume of per-node computation and inter-node communication.

III-C Distributed Training Time Estimation

The computation and communication requirements per layer for each (MP, DP) combination are fed into a performance model that estimates the model’s training time. The performance model estimates end-to-end training time as a function of the modeled cluster’s per-node compute capability, memory capacity and bandwidth, network bandwidth and topology.

A key design decision in \TheNameis opting for generality and breadth rather than detailed modeling of individual components, as our methodology is intended to enable rapid and scalable exploration of a vast design space, rather than highly accurate performance estimations. \TheName’s goal is to allow gleaning performance trends as cluster parameters are varied both jointly and separately. Therefore, we chose to go with detailed analytical models for compute, memory, and network rather than be tied to any specific technology or component instance. Performance estimation for DL training models lends itself well to analytical modeling (as opposed to cycle-level simulation), as their computation, memory access, and communication patterns exhibit regularity typically absent from general workloads [5, 39]. By swee** generic characteristics like TOPS, bandwidth, latency, \TheName’s users may easily create proxies for specific components or technologies of interest, such as GPUs with different computational capabilities, memories with different capacity/bandwidth characteristics (e.g., HBM vs. DDR), or networks with different bandwidth/latency characteristics (e.g., InfiniBand vs. NVLink). Next, we describe how individual performance models are constructed and tied together in \TheName.

III-C1 Compute delay estimation

To produce compute-delay estimations independent of any compute node’s microarchitectural details, we employ a roofline model [49, 77]. In each case, the compute node of interest is represented by its peak performance (perfpeak𝑝𝑒𝑟subscript𝑓𝑝𝑒𝑎𝑘{perf}_{peak}italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT in GFLOPS𝐺𝐹𝐿𝑂𝑃𝑆GFLOPSitalic_G italic_F italic_L italic_O italic_P italic_S) and memory bandwidth.

Fig. 4 shows a roofline model, which consists of a compute-bound region (under the flat line) and a memory-bound region (under the slanted line). The flat line is dictated by the target node’s perfpeak𝑝𝑒𝑟subscript𝑓𝑝𝑒𝑎𝑘{perf}_{peak}italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT. The slanted line’s slope denotes the compute node’s available memory bandwidth BWmem𝐵subscript𝑊𝑚𝑒𝑚BW_{mem}italic_B italic_W start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT(GB/s𝐺𝐵𝑠GB/sitalic_G italic_B / italic_s). In \TheName’s design space exploration, a change in the available memory bandwidth changes the roofline’s slope and, correspondingly, the intersection point with the perfpeak𝑝𝑒𝑟subscript𝑓𝑝𝑒𝑎𝑘{perf}_{peak}italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT horizontal line shifts.

For each training phase, a workload layer’s operational intensity (OI𝑂𝐼OIitalic_O italic_I) is calculated as

OI(FLOPs/byte)=# of floating-point operationsmemory traffic𝑂𝐼𝐹𝐿𝑂𝑃𝑠𝑏𝑦𝑡𝑒# of floating-point operationsmemory trafficOI\,(FLOPs/byte)=\frac{\textit{\# of floating-point operations}}{\textit{% memory traffic}}italic_O italic_I ( italic_F italic_L italic_O italic_P italic_s / italic_b italic_y italic_t italic_e ) = divide start_ARG # of floating-point operations end_ARG start_ARG memory traffic end_ARG (1)

where memory traffic is the total number of bytes moved between the memory and processor for operand and result matrices of the layer. § III-C2 elaborates on our estimation of memory traffic generated by each layer. Based on the target compute node’s characteristics, a workload layer’s OI may place it under the roofline’s compute-bound or memory-bound region. The maximum compute performance achieved for a layer i𝑖iitalic_i with OIi𝑂subscript𝐼𝑖OI_{i}italic_O italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is 𝑝𝑒𝑟𝑓max(GFLOPS)=min{perfpeak,OIi×BWmem}subscript𝑝𝑒𝑟𝑓𝑚𝑎𝑥𝐺𝐹𝐿𝑂𝑃𝑆𝑚𝑖𝑛𝑝𝑒𝑟subscript𝑓𝑝𝑒𝑎𝑘𝑂subscript𝐼𝑖𝐵subscript𝑊𝑚𝑒𝑚\textit{perf}_{max}\,(GFLOPS)=min\left\{{perf}_{peak},\,OI_{i}\times BW_{mem}\right\}perf start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_G italic_F italic_L italic_O italic_P italic_S ) = italic_m italic_i italic_n { italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT , italic_O italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_B italic_W start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT }.

Refer to caption
Figure 4: Roofline model. Attainable performance shifts for the same OI, depending on available memory bandwidth.

Using perfpeak𝑝𝑒𝑟subscript𝑓𝑝𝑒𝑎𝑘{perf}_{peak}italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT and BWmem𝐵subscript𝑊𝑚𝑒𝑚BW_{mem}italic_B italic_W start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT as knobs in our methodology, we estimate their impact on the maximum attainable performance 𝑝𝑒𝑟𝑓maxsubscript𝑝𝑒𝑟𝑓𝑚𝑎𝑥\textit{perf}_{max}perf start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT for each workload layer, and thereby the compute delay for each layer i𝑖iitalic_i in each training phase as

computedelayi(s)=# of floating-point operations𝑝𝑒𝑟𝑓max𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑𝑒𝑙𝑎subscript𝑦𝑖𝑠# of floating-point operationssubscript𝑝𝑒𝑟𝑓𝑚𝑎𝑥compute\ delay_{i}\,(s)=\frac{\textit{\# of floating-point operations}}{% \textit{perf}_{max}}italic_c italic_o italic_m italic_p italic_u italic_t italic_e italic_d italic_e italic_l italic_a italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) = divide start_ARG # of floating-point operations end_ARG start_ARG perf start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG (2)

The total compute delay for one training iteration is the sum of compute delays for the forward pass, backward pass, and weight update for each of the model’s layers.

The roofline model makes our workflow fast and versatile, and while not enough to estimate absolute performance, it captures performance trends. Despite having limitations, as a first-order model, the roofline model has still been used widely in a broad range of prior work [13, 22, 35, 41, 49, 74, 76, 77, 80, 6, 81] due to its versatility and utility in quickly and correctly highlighting general performance trends, given the general compute and memory capabilities of a compute node. \TheNameas a general methodology is not limited to roofline and its modular design allows plugging in more detailed compute models, such as runtimes captured from real GPUs or simulated accelerators modeled in appropriate tools, like GPGPU-Sim [30], or ScaleSim [63]. However, in this paper we use the more general roofline model to focus on the methodology’s utility and decouple the results of the conducted case studies from any specific compute unit’s microarchitectural characteristics.

III-C2 Memory traffic estimation

Memory traffic is the cumulative number of bytes transferred between the main memory and compute unit while performing the desired functionality. For a hypothetical compute node with infinite on-chip buffer space, all operands can be fetched exactly once from the memory, resulting in a very high OI (cf. Eqn. 1). However, every realistic compute unit’s limited on-chip buffer space can only hold a limited set of data operands. Therefore, a layer’s matrix operands must usually be fetched multiple times from the memory to complete the required operations, thereby lowering the resulting OI. We construct a linear model to better estimate the memory traffic for a GEMM operation on a compute node with an on-chip buffer of configurable size.

Consider a GEMM operation between two matrices of U𝑈Uitalic_U and V𝑉Vitalic_V bytes, generating an output matrix of W𝑊Witalic_W bytes. We assume one of the input operands is tiled to fit in the on-chip buffer, and the other operand/output are streamed in/out of the compute node, respectively. For an on-chip buffer size of S𝑆Sitalic_S bytes, we estimate the memory traffic (in bytes) as min{Ψ1,Ψ2}+W𝑚𝑖𝑛subscriptΨ1subscriptΨ2𝑊min\left\{\Psi_{1},\Psi_{2}\right\}+Witalic_m italic_i italic_n { roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + italic_W, where Ψ1=U/S×V+UsubscriptΨ1𝑈𝑆𝑉𝑈\Psi_{1}=\lceil U/S\rceil\times V+Uroman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⌈ italic_U / italic_S ⌉ × italic_V + italic_U and Ψ2=V/S×U+VsubscriptΨ2𝑉𝑆𝑈𝑉\Psi_{2}=\lceil V/S\rceil\times U+Vroman_Ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ⌈ italic_V / italic_S ⌉ × italic_U + italic_V. In practice, for U𝑈Uitalic_U and V>>Smuch-greater-than𝑉𝑆V>>Sitalic_V > > italic_S, tiling the smaller operand results in less data movement (e.g., if U<V𝑈𝑉U<Vitalic_U < italic_V, Ψ1subscriptΨ1\Psi_{1}roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the tiling method of choice, resulting in about VU𝑉𝑈V-Uitalic_V - italic_U less data movement).

An additional important architectural design knob we want to investigate with \TheNameis that of memory expansion, whereby the compute unit’s local memory (LM) is enhanced with a secondary level of memory, which we refer to as expanded memory (EM). Such a setting can be enabled by allowing the compute unit to directly access its host CPU’s memory, or by physically attaching additional memory over CXL [12], photonic links, or other (current or future) technology. Investigating such an option using CXL-attached memory (which offers considerably higher bandwidth per pin than DDR) for DL training is of particularly high relevance given the growing interest in deploying it as a new memory hierarchy component [18, 36, 40].

To investigate this system design option, we consider per-node memory expansion, with the available bandwidth as our sensitivity analysis knob. To model performance with such a hybrid memory system, we instrument our roofline model with the new memory system’s effective memory bandwidth (bwhybrid𝑏subscript𝑤𝑦𝑏𝑟𝑖𝑑{bw}_{hybrid}italic_b italic_w start_POSTSUBSCRIPT italic_h italic_y italic_b italic_r italic_i italic_d end_POSTSUBSCRIPT), which depends on the fraction of data accessed from local/expanded memory (dataLM/dataEM𝑑𝑎𝑡subscript𝑎𝐿𝑀𝑑𝑎𝑡subscript𝑎𝐸𝑀data_{LM}/data_{EM}italic_d italic_a italic_t italic_a start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT / italic_d italic_a italic_t italic_a start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT) at the local/expanded memory’s bandwidth (bwLM/bwEM𝑏subscript𝑤𝐿𝑀𝑏subscript𝑤𝐸𝑀bw_{LM}/bw_{EM}italic_b italic_w start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT / italic_b italic_w start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT). We estimate the hybrid memory system’s effective bandwidth as:

bwhybrid=total_data_accesseddatalocalbwLM+dataEMbwEM𝑏subscript𝑤𝑦𝑏𝑟𝑖𝑑𝑡𝑜𝑡𝑎𝑙_𝑑𝑎𝑡𝑎_𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑑𝑑𝑎𝑡subscript𝑎𝑙𝑜𝑐𝑎𝑙𝑏subscript𝑤𝐿𝑀𝑑𝑎𝑡subscript𝑎𝐸𝑀𝑏subscript𝑤𝐸𝑀bw_{hybrid}=\frac{total\_data\_accessed}{\frac{data_{local}}{bw_{LM}}\,+\,% \frac{data_{EM}}{bw_{EM}}}\vspace{-0.5em}italic_b italic_w start_POSTSUBSCRIPT italic_h italic_y italic_b italic_r italic_i italic_d end_POSTSUBSCRIPT = divide start_ARG italic_t italic_o italic_t italic_a italic_l _ italic_d italic_a italic_t italic_a _ italic_a italic_c italic_c italic_e italic_s italic_s italic_e italic_d end_ARG start_ARG divide start_ARG italic_d italic_a italic_t italic_a start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_b italic_w start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_d italic_a italic_t italic_a start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT end_ARG start_ARG italic_b italic_w start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT end_ARG end_ARG (3)

To illustrate, accessing 240GB of data in a hybrid memory system with 80GB of LM, bwLM=2TB/s𝑏subscript𝑤𝐿𝑀2𝑇𝐵𝑠bw_{LM}=2TB/sitalic_b italic_w start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT = 2 italic_T italic_B / italic_s, and bwEM=1TB/s𝑏subscript𝑤𝐸𝑀1𝑇𝐵𝑠bw_{EM}=1TB/sitalic_b italic_w start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT = 1 italic_T italic_B / italic_s results in bwhybrid=1.2TB/s𝑏subscript𝑤𝑦𝑏𝑟𝑖𝑑1.2𝑇𝐵𝑠bw_{hybrid}=1.2TB/sitalic_b italic_w start_POSTSUBSCRIPT italic_h italic_y italic_b italic_r italic_i italic_d end_POSTSUBSCRIPT = 1.2 italic_T italic_B / italic_s. Using Eqn. 3, we can determine the cluster’s performance as a function of the bandwidth offered by the hypothetical memory expansion technique used.

III-C3 Communication delay estimation

During the training process, nodes continuously exchange data. Their communication delay is dictated by the total communication volume, the aggregate network bandwidth available between the nodes, and the dynamic network utilization. In addition, depending on the training phase, the communication among the nodes may be blocking or non-blocking. In the forward-pass and input-gradient phases, communication is blocking along the model-parallel dimension; in the weight-gradient phase, communication is non-blocking across the data-parallel dimension. Blocking communication falls on the critical path of a training phase, while non-blocking communication can be (partially) overlapped with compute, thus ameliorating its impact on resulting training time. The combination of data movement volume, available network bandwidth, communication type, and concurrently performed computation, dictates how much of the occurring communication is exposed, affecting training time. Ultimately, for each layer, the total exposed communication delay determines whether the layer is compute- or communication-bound on a given system configuration.

III-C4 Total training time estimation

Finally, we combine the per-layer compute delays (§ III-C1) with each layer’s corresponding communication characteristics (data volume and communication type—§ III-C3) to determine the degree of computation/communication overlap and derive the training time per iteration and, by extension, total training time.

Overall, \TheNamecomprises an iterative training time estimation process as shown in Fig. 2. For each model, different training strategies are considered (cf. § III-B) and the training time is estimated as described in § III-C. The process is repeated for different cluster sizes, and compute, network, and memory parameters to guide the user’s selection of parallelization strategy and cluster resource provisioning to optimize for the target metric of merit—raw training performance, or training efficiency (i.e., training time relative to resources deployed).

IV \TheNameImplementation

Fig. 5 shows the implementation of our toolchain implementing the \TheNamemethodology. The toolchain’s frontend, in steps 1⃝ and 2⃝, generates parameters for the per-layer computation and communication requirements. These parameters are then fed into the toolchain’s backend, which models the resulting performance of the training task, as a function of the target cluster’s configuration (steps 3⃝ and 4⃝).

IV-A DL Model Analysis

In step 1⃝, we analyze the DL model of interest and break it down into its layers. Each layer is represented as a sequence of GEMMs of input activations, model parameters and resulting output activation matrices. The size of each matrix dimension M, K, N is derived from the model’s hyper-parameters and batch size. Depending on the model type, a set of fixed independent hyper-parameters form the model’s signature. For example, in case of Transformers, hidden-dimension, # layer-stacks, and # attention-heads characterize the model size. Then, based on the derived GEMM dimensions, we compute the total number of model and activation parameters required for each layer. In addition, the size of each operand matrix (i.e., input activations, model parameters, and output matrices) per node is affected by the MP/DP degree.

Refer to caption
Figure 5: \TheNameimplementation and workflow.

IV-B Parallelization Strategy

In step 2⃝, we select a parallelization strategy for the given workload, generate the corresponding workload input file and feed it to the performance simulator to estimate the distributed training time. In addition, the workload’s required memory footprint is computed by aggregating model parameters, optimizer states, gradients, residual states, and checkpoint-activations. The workload input file must describe the characteristics of each layer of the workload, which includes the number of floating-point operations, data volume (in bytes) moved between memory and compute, communication collective, and communication volume (in bytes). As per ZeRO-Infinity, we compute the total number of operations and matrix operands required in both forward (FP) and backward (i.e., input gradient (IG) and weight gradient (WG)) training phases.

As described in § III-C2, we estimate the bytes transferred between the processor and main memory for each layer in each training phase. Similarly, we estimate the total number of operations and matrix operands required for each layer in each training phase. Based on the parallelization scheme used for the workload, we also determine the communication collective and compute the communication volume required per layer in each training phase, as well as the required per-node memory footprint to fit the model states and working dataset.

We use ZeRO-DP (os+g) [53], a.k.a. ZeRO-2, to derive the per-node memory footprint of model-states. ZeRO-2 avoids replication of optimizer states and gradients on each node and distributes them across the data-parallel dimension to reduce the per-node memory footprint, while avoiding additional communication overhead. For residual states, we estimate the memory footprint as # activation-parameters×2 bytes# activation-parameters2 bytes\textit{\# activation-parameters}\times\textit{2 bytes}# activation-parameters × 2 bytes assuming fp16 activation parameters. We exclude the memory required for checkpoint activations in our per-node memory footprint estimate. Typically, in large models such as Transformer-1T, the memory footprint required to store the checkpoint activations is significantly larger than the intermediate activations and hence are offloaded to host memory. Therefore, during the training process, we only consider the Activation Working Memory [54], which is the memory required to hold the intermediate activations between two consecutive checkpoints.

Refer to caption
Figure 6: Per-node memory footprint for Transformer-1T model with baseline and different ZeRO-DP stages.

As explained in § III-B, the parallelization strategy affects the required memory footprint per node. To illustrate with a concrete use case, Fig. 6 shows how the per-node memory footprint requirement changes for a Transformer-1T model on a fixed cluster size of 1024 nodes, as a function of different stages of ZeRO and a decreasing MP degree (the invariant being DP×MP=1024𝐷𝑃𝑀𝑃1024DP\times MP=1024italic_D italic_P × italic_M italic_P = 1024). In baseline (i.e., no ZeRO optimizations), the model footprint per node increases exponentially as the MP degree reduces. The same trend holds even with memory optimizations like ZeRO-2: although the growth is slower, the model footprint per node eventually exceeds the typical memory capacity of a single device, highlighting the value of MP to enable in-memory training of huge models. Among all the ZeRO-DP optimizations, ZeRO-3 stands out as it provides the lowest memory footprint per node and remains unaffected by MP reduction. However, ZeRO-3 incurs a 1.5×1.5\times1.5 × communication overhead compared to the baseline. Other approaches such as ZeRO-Offload and ZeRO-Infinity offload data and compute to the host machine resources to reduce the memory capacity pressure on accelerator nodes.

IV-C Total Training Time Estimation

As shown in Figure 5 step 3⃝, \TheNameplugs into a cost model for training time estimation. We use the ASTRA-sim simulator [57, 79] for this purpose. ASTRA-sim is a discrete event-based simulator developed by Meta, Intel and Georgia Tech that can simulate distributed training for a variety of DL workloads. It accepts a workload configuration, topology description and system parameter file to simulate the distributed DL training on a target cluster, and outputs the end-to-end training time breakdown and resource utilization.

At a high level, ASTRA-sim consists of a workload, system, and network layer. The workload layer is responsible for instantiating a model and scheduling training loops for simulation. The system layer provides the mechanisms for collective primitives and scheduling of communication tasks, similar to collective communication libraries like NCCL [47]. The network layer provides the topology interface via network APIs to support a fast analytical model [79] or detailed network simulation using Garnet [2] and NS3 [62].

An ASTRA-sim workload configuration file describes the DL workload to be simulated, and consists of the compute time, communication collective, and the collective size for each layer of the workload. For each layer, ASTRA-sim schedules the communication collectives and overlaps the communication delay with compute delay to estimate the total training time. ASTRA-sim’s analytical network backend—used in our current \TheNameimplementation—estimates the communication delay of an event based on the topology and network bandwidths specified in the configuration files. ASTRA-sim supports multiple collective communication primitives and scheduling algorithms. The symmetric network topology of distributed training platforms and topology-aware collective communication algorithms minimize the network congestion, enabling the analytical network backend to accurately model the communication overhead [31, 55, 58, 79]. ASTRA-sim’s runtime projections for 8–16 node clusters has been validated against real systems to be within 5% difference [79].

For \TheName, we chose ASTRA-sim as the cost model for training time estimation given its modular architecture for plug-and-play compute and communication models, and implementations of diverse collective communication algorithms and scheduling strategies. We integrated our roofline and data movement models (§ III-C1 and III-C2) in ASTRA-sim to enable modeling a range of compute units with hybrid memories (LM and EM). Using our added models and the data provided in the workload input file (operations and data size per layer), the compute delay per layer is estimated. ASTRA-sim uses the compute delay, communication collectives and volume, the network topology, and system parameters provided to perform a training simulation and generates an end-to-end per-layer training-time breakdown for each training phase. It reports the compute and exposed (i.e., non-overlapped) communication times for each layer in the FP, IG, and WG phases.

For our design space exploration of parallelization strategies on a cluster of size N𝑁Nitalic_N, we sweep the degree of MP and DP such that always MP×DP=N𝑀𝑃𝐷𝑃𝑁MP\times DP=Nitalic_M italic_P × italic_D italic_P = italic_N. This emulates the effective increase in per-node memory capacity as MP decreases in favor of DP, as demonstrated in Fig. 6, and generates the corresponding workload input file for each (MP, DP) combination.

IV-D Cluster Parameter Reconfiguration

Finally, step 4⃝ provides a set of knobs to perform sensitivity analysis by varying the key component parameters, which include the network topology, compute capability, as well as memory and network bandwidth and latency.

IV-E Iterative Modeling for Design Space Exploration

By iterating through steps 2⃝ to 4⃝ (§ IV-B§ IV-D), we obtain the resulting training time for different system configurations and hardware parameters to identify the best combination of parallelization strategy and cluster resources. The target optimization metric may be raw performance or cost efficiency (i.e., performance relative to cluster’s provisioned resources).

V Case Studies

We now leverage \TheNameto evaluate cluster design decisions across multiple dimensions (cluster size and network, per-node memory and compute capability) in the context of large-scale Transformer and DLRM training. We first describe the models and the cluster we model as our baseline system (§ V-A). We then evaluate the impact of different cluster configuration parameters using Transformer (§ V-B) and DLRM (§ V-C) models. § V-D employs \TheNameto compare a range of different clusters on a range of models. Finally, § V-E concludes with a summarizing overview of \TheName’s versatility and quantification of the tool’s key strength of speed, enabling rapid design space explorations for distributed training tasks on large clusters.

Refer to caption
Figure 7: Cluster of 1024 A100 GPUs with expanded memories, grouped in 128 8-GPU pods. Link bandwidth is per direction.
TABLE I: Baseline NVIDIA DGX A100 system parameters.
Single-node Parameters (NVIDIA A100 GPU)
Peak Performance (𝑝𝑒𝑟𝑓peaksubscript𝑝𝑒𝑟𝑓𝑝𝑒𝑎𝑘\textit{perf}_{peak}perf start_POSTSUBSCRIPT italic_p italic_e italic_a italic_k end_POSTSUBSCRIPT) 624 TFLOPS (fp16)
Local Memory Capacity / Bandwidth 80 GB / 2039 GB/s
On-chip SRAM size 40 MB
Cluster Parameters
Compute Pod NVIDIA A100 DGX (8-GPU)
Cluster Size 1024 nodes (128 pods ×\times× 8 GPUs)
Intra-pod Network BW per GPU 300 GB/s / direction (NVLink Gen-3)
Inter-pod Network BW per GPU 31.25 GB/s / direction (InfiniBand)
Physical Topology Hierarchical Switch
Collectives Implementation Logical Ring

V-A Baseline Evaluation Setup and Workloads

Fig. 7 visualizes our baseline 1k-node DGX A100 [48] cluster with expanded memories and Table I summarizes the parameters used to model it in ASTRA-sim. We evaluate training performance for Transformer and DLRM models, which represent the largest models currently deployed. Our evaluation predominantly focuses on the Transformer model due to its higher complexity and broader range of (MP, DP) training strategies (§ V-B). Due to space constraints, we present a small subset of DLRM evaluation results in § V-C.

V-A1 Transformer Model

Transformer-based language models are huge natural language processing models used extensively in language modeling, machine translation, text summarization, AI chatbots (e.g., ChatGPT), etc. Latest Transformer models comprise up to trillion parameters and must be trained over hundreds of high-end processing nodes in a distributed manner using multiple levels of parallelization [43, 66], despite advanced techniques employed reduce their required memory footprint [16, 53, 54].

A Transformer model comprises a stack of multiple encoder and decoder structures, each composed of multi-head attention layers followed by fully connected feed-forward and residual layers [73]. Each of the encoder and decoder stack takes input and output embeddings as inputs which map a sequence of symbols into a continuous representation. We model the Transformer-1T architecture and hybrid model & data parallelism approach as described in Megatron-LM [66]. Table II breaks down the Transformer-1T model into its layers, and summarizes each layer’s type and dimensions.

TABLE II: Transformer model layers and dimensions.
Layer Type #Stacks GEMM Dimensions
M K N
Input Embedding Table Look-up 1 b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q sub_vocab𝑠𝑢𝑏_𝑣𝑜𝑐𝑎𝑏sub\_vocabitalic_s italic_u italic_b _ italic_v italic_o italic_c italic_a italic_b dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Layer Norm Element-Wise Mult N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q 1 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Query Projection (Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT h×dksubscript𝑑𝑘h\times d_{k}italic_h × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
Key Projection (Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT h×dksubscript𝑑𝑘h\times d_{k}italic_h × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
Value Projection (Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT h×dvsubscript𝑑𝑣h\times d_{v}italic_h × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT
Ui=softmax(QiKiT/dk)subscript𝑈𝑖𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑄𝑖subscriptsuperscript𝐾𝑇𝑖subscript𝑑𝑘U_{i}=softmax(Q_{i}K^{T}_{i}/\sqrt{d_{k}})italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q h×dksubscript𝑑𝑘h\times d_{k}italic_h × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q
Yi=UiVisubscript𝑌𝑖subscript𝑈𝑖subscript𝑉𝑖Y_{i}=U_{i}V_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q h×dvsubscript𝑑𝑣h\times d_{v}italic_h × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT
concat(Zi=YiBi,..,ZnZ_{i}=Y_{i}B_{i},..,Z_{n}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , . . , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q h×dvsubscript𝑑𝑣h\times d_{v}italic_h × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Residual Addition Element-Wise Add N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q 1 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Layer Norm Element-Wise Mult N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q 1 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Yi=GeLU(XAi+k)subscript𝑌𝑖𝐺𝑒𝐿𝑈𝑋subscript𝐴𝑖𝑘Y_{i}=GeLU(XA_{i}+k)italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G italic_e italic_L italic_U ( italic_X italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k ) GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT sub_𝑓𝑓𝑠𝑢𝑏_𝑓𝑓sub\_{\textit{ff}}italic_s italic_u italic_b _ ff
Zi=YiBi+ksubscript𝑍𝑖subscript𝑌𝑖subscript𝐵𝑖𝑘Z_{i}=Y_{i}B_{i}+kitalic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k GEMM N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q sub_𝑓𝑓𝑠𝑢𝑏_𝑓𝑓sub\_{\textit{ff}}italic_s italic_u italic_b _ ff h×dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙h\times d_{model}italic_h × italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Residual Addition Element-Wise Add N b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q 1 h×dvsubscript𝑑𝑣h\times d_{v}italic_h × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT
Output Embedding Table update 1 b×seq𝑏𝑠𝑒𝑞b\times seqitalic_b × italic_s italic_e italic_q dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT sub_vocab𝑠𝑢𝑏_𝑣𝑜𝑐𝑎𝑏sub\_vocabitalic_s italic_u italic_b _ italic_v italic_o italic_c italic_a italic_b
Legend - dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT: hidden dimension, hhitalic_h: # of attention heads, b𝑏bitalic_b: mini-batch size, seq𝑠𝑒𝑞seqitalic_s italic_e italic_q: sequence length,
dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT/dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT: key/value tensor dimension per attention head, sub𝑓𝑓𝑠𝑢subscript𝑏𝑓𝑓sub_{\textit{ff}}italic_s italic_u italic_b start_POSTSUBSCRIPT ff end_POSTSUBSCRIPT: portion of MLP layer per MP node,
sub_vocab𝑠𝑢𝑏_𝑣𝑜𝑐𝑎𝑏sub\_vocabitalic_s italic_u italic_b _ italic_v italic_o italic_c italic_a italic_b: chunk of total vocabulary per MP node
Refer to caption
(a) Breakdown of comp./comm. time & per-node memory footprint.
Refer to caption
(b) Exposed communication to compute ratio.
Figure 8: Training runtime of Transformer-1T with varying MP/DP degree (norm. to best-performing MP8_DP128 config.).

V-A2 DLRM

DLRMs are among the largest deep learning models used widely for generating personalized content [11, 20, 45] (e.g., movie recommendations [17, 32], e-commerce catalogs [67, 75, 86]), comprising up to trillions of parameters. DLRM model contains large embedding tables that capture the latent space of user and product feature interaction. In each DLRM, inputs are classified into sparse and dense features which represent the categorical data and continuous features. While sparse features are used to look up the embedding tables, dense features are processed using a bottom stack of MLP layers. The result of embedding lookup and output of bottom MLP layers are combined and given as input to the top MLP layers which finally predict the probability of an event (such as click-rate) occurrence. We model the DLRM architecture and parallelization strategy as described by Rashidi et al. [56].

Unlike Transformer models that can be trained using a wide range of model versus data parallelism configuration points, the training structure for DLRMs is more rigid. DLRM follows a hybrid parallelization strategy where the bottom embedding layers are sharded across multiple nodes and performs an all-to-all communication during forward and backward propagation while the MLP layers are replicated on each node in a data parallel fashion and perform an all-reduce during backward propagation. DLRM therefore does not offer the same MP/DP configuration knob we sweep for Transformer models.

V-B Transformer Evaluation

A Transformer model can be trained via a wide range of parallelization strategies. We first evaluate the impact of different (MP, DP) parallelization strategies on that cluster’s performance and per-node memory requirements (§ V-B1). We then demonstrate the impact of individually scaling the cluster’s per-node memory system (§ V-B2), per-node compute capability (§ V-B3), and network (§ V-B4).

V-B1 Parallelization Strategy Impact on Performance and Memory Requirements

We begin by swee** (MP, DP) under the invariant MP×DP=1024𝑀𝑃𝐷𝑃1024MP\times DP=1024italic_M italic_P × italic_D italic_P = 1024. In this section, training time estimations ignore per-GPU memory capacity constraints, assuming infinite per-node memory capacity accessible at the baseline system’s peak memory bandwidth.

Fig. 8a shows the training time breakdown and corresponding per-node memory footprint for several (MP, DP) configurations, assuming a constant memory bandwidth of 2039GB/s, irrespective of capacity. The total training time is a combination of compute delays and exposed communication delays (cf. § III-C4). The compute and exposed communication time is broken down into three components to indicate the three main phases in a training iteration: forward pass (FP), input gradients (IG), and weight gradients (WG). As MP decreases in favor of more DP groups, the required memory footprint per node increases; therefore, fitting the model in our baseline GPU’s 80GB memory requires an MP degree of 64 or higher.

Memory capacity requirements aside, the best-performing configuration is MP8_DP128, where the exposed communication in FP and IG phases, and the compute delay exhibit their lowest values. Note that WG communication (WG_Exp_Comm) is fully overlapped by the WG compute (WG_Compute) in every configuration, hence not visible. The configurations left of MP8_DP128 (higher MP) are communication bound due to the exposed blocking communication patterns in FP and IG phases across the MP dimension. In contrast, for configurations right of MP8_DP128 (lower MP), the effective model footprint per node grows as the model is distributed across fewer nodes per DPU, and hence becomes more memory bound resulting in higher compute delays dominating runtime, while barely any communication is exposed.

Fig. 8b better highlights the changing balance between compute and exposed communication time for the same (MP, DP) range. Under high MP degrees (e.g., MP64_DP16), training time is dominated by exposed communication time. As MP decreases in favor of increasing DP, the fraction of runtime spent on communication becomes negligible from MP8 onwards. MP8_DP128 is the optimal configuration because it strikes the best balance, effectively overlap** communication delays with compute, without getting into a memory-bandwidth-bound region that causes drastic compute time increase.

V-B2 Effect of Memory System Design

Based on § V-B1’s results, the best-performing MP8_DP128 configuration requires similar-to\sim250GB of memory to fit the model, exceeding the 80GB the NVIDIA A100 GPU baseline’s per-node memory capacity by 3×3\times3 ×. The best-performing configuration achievable under the 80GB memory constraint is MP64_DP16. The required capacity for MP8_DP128 could be achieved with a hybrid memory system, complementing the GPU’s HBM with a secondary DRAM-based memory that offers additional capacity, albeit accessible at lower bandwidth. As mentioned in § III-C2, such additional capacity could be provided by allowing the GPU to access its host CPU’s memory, or by attaching additional memory to the GPU over CXL [12] or other technology.

Fig. 9 shows the performance results normalized to MP64_DP16, the best-performing configuration of in-memory distributed training that is feasible without memory expansion (cf. Fig. 8a). The heatmap’s x-axis shows the bandwidth to expanded memory, while the varying (MP, DP) degree on the y-axis is a proxy for the required capacity of that expanded memory (see memory requirements in Fig. 8a). MP64_DP16 and configurations with higher MP remain unaffected by the expanded memory’s bandwidth, as the dataset entirely fits in each node’s local memory, and configurations with MP higher than 256 are omitted, as they perform strictly worse.

Refer to caption
Figure 9: Effect of memory bandwidth availability to the expanded memory system. The check-mark indicates the baseline configuration to which all the other values are normalized.

The heatmap guides system architects in determining what memory expansion technology can be used to boost a cluster’s training performance, revealing the range of expanded memory characteristics that allow building a cluster with lower training time than the MP64_DP16 baseline. Conversely, the data can also be leveraged to derive the equally valuable information of memory technologies that would not be applicable.

We provide two illustrative examples derived from Fig. 9:

Ex. 1: The theoretically optimal MP8_DP128 configuration is achievable with a hybrid memory system offering at least 4.25×4.25\times4.25 × higher aggregate capacity than the baseline (cf. Fig. 8a), and outperforms the baseline if the expanded memory is accessible at a bandwidth of at least 500GB/s.

Ex. 2: A system architect considering CXL-attached memory can quickly deduce that, in order to benefit the given workload, the technology must be capable of delivering 500GB/s to 340GB of memory, at a minimum. In practice, that would require a memory device accessible over 32 lanes of CXL 3.0.

V-B3 Effect of Per-node Compute Capability

The tremendous demand for DL training drives rapid evolution of the hardware used in clusters deployed for such purpose. \TheNamecan be used to study the effect of per-node compute capability by scaling each node’s peak performance (perfmax𝑝𝑒𝑟subscript𝑓𝑚𝑎𝑥{perf}_{max}italic_p italic_e italic_r italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) to model hardware of different generations.

Fig. 10 shows the effect of per-node compute scaling on the MP8_DP128 configuration, assuming a memory system of varying bwhybrid𝑏subscript𝑤𝑦𝑏𝑟𝑖𝑑bw_{hybrid}italic_b italic_w start_POSTSUBSCRIPT italic_h italic_y italic_b italic_r italic_i italic_d end_POSTSUBSCRIPT as a function of bwEM𝑏subscript𝑤𝐸𝑀bw_{EM}italic_b italic_w start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT, and sufficient capacity to hold the model. At the highest bwEM𝑏subscript𝑤𝐸𝑀bw_{EM}italic_b italic_w start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT of 2TB/s, halving compute capability—e.g., by replacing the baseline A100 with a lower-end GPU—increases the runtime by 50%percent5050\%50 %, while doubling it reduces the runtime by 25%. Scaling compute further has diminishing returns, as training becomes communication bound. For lower memory bandwidth availability, the impact of compute capability scaling further diminishes due to the additional bottleneck of memory bandwidth. System designers can employ such studies to predict the impact of a next-generation GPU on a cluster’s overall performance.

Refer to caption
Figure 10: Effect of node compute capability relative to baseline A100 GPU (MP8_DP128 configuration).

V-B4 Effect of Networking Capability

A cluster’s network bandwidth plays a critical role in the overall training time. Especially at higher MP configurations, training time is dominated by the exposed blocking communication patterns in forward and input gradient phases. In our modeled cluster, there is an intra-pod bandwidth to communicate with GPUs within a pod and a lower inter-pod bandwidth for communication across pods (see Table I). We use a Hierarchical Collective implementation [10, 58] in our evaluations, which first reduces data across GPUs of the same pod, followed by inter-pod reduction. Such local network bandwidth-aware collective optimization reduces the communication volume on the lower-bandwidth inter-pod links.

Refer to caption
(a) MP64_DP16
Refer to caption
(b) MP8_DP128
Figure 11: Effect of network capabilities on cluster-scale performance. The check-mark indicates the baseline configuration to which all the other values are normalized. 300/31.25 GB/s is currently a typical NVLink/InfiniBand configuration.

Fig. 11 shows the effect of both intra- and inter-pod network bandwidth scaling on training time for two different (MP, DP) configurations: the communication-bound MP64_DP16 and the compute-bound MP8_DP128. In each plot, we vary the two network bandwidths while kee** the compute and memory bandwidth constant to the baseline system’s values (Table I).

Unsurprisingly, MP64_DP16 is majorly affected by network bandwidth, as the MP dimension straddles several 8-node pods, which feature high intra-pod bandwidth. Halving intra-/inter-pod bandwidth results in a 48%/22% slowdown, respectively. On the contrary, doubling intra-/inter-pod bandwidth reduces runtime by 3%/18%, respectively, while doubling both reduces runtime by 27%. Evidently, the most effective network scaling scales bandwidth on both dimensions, while scaling only one dimension’s bandwidth provides reduced or marginal gains.

This effect is attributed to the map** of the workload on the underlying cluster. The performance-critical all-reduce communication in forward and backward propagation is bottlenecked by both intra- and inter-pod links. Thus, increasing only one dimension’s capability yields limited gains (as the other dimension remains a bottleneck), while boosting the capabilities of both dimensions has an amplificatory effect. In contrast, when MP8_DP128 training is feasible, the network’s role is much less critical. Reducing both intra- and inter-pod bandwidth to 50% only degrades performance by 11%, while even increasing both by 4×4\times4 × only boosts performance by 5%.

We conduct an additional experiment to focus on identifying the ideal balance of the bandwidth available at the two dimensions. Instead of varying the two values independently, as previously done in Fig. 11, Fig. 12 shows the performance trends when the available network bandwidth is re-distributed between intra-/inter-pod links, while its aggregate value always remains fixed. For the baseline MP64_DP16 configuration with an aggregate full duplex bandwidth of 331.25 GB/s per direction (300 GB/s intra-pod and 31.25GB/s inter-pod), we find an optimal bandwidth ratio of 1:6 between inter-/intra-pod network bandwidth (i.e., 284 GB/s intra-pod and 47.32 GB/s inter-pod). For lower/higher ratios, inter-/intra-pod network traffic becomes a bottleneck, respectively. In the case of compute-bound MP8_DP128, where the MP dimension is entirely contained within a pod, the performance-critical MP communication is entirely dictated by intra-pod bandwidth while the DP communication across the pods is affected by the inter-pod link bandwidth. Therefore, MP communication is not a bottleneck and performance is largely insensitive to the rebalanced bandwidth ratio. Performance starts drop** beyond 1:5 (i.e., with intra-pod bandwidth less than 276GB/s), as MP communication starts reappearing as a bottleneck. Overall, 1:6 inter-/intra-pod network bandwidth provisioning appears to be the ratio that best accommodates both training configurations, improving training time by up to 15%percent1515\%15 % compared to the default 1:9.6 ratio.

Refer to caption
Figure 12: Impact of relative inter-/intra-pod network bandwidth allocation on training runtime. The total bandwidth is kept constant at 331.25 GB/s across all the ratios. The highlighted 1:9.6 ratio represents the baseline cluster’s configuration.

System designers can use such analysis to determine the topology and interconnect technology to use (e.g., Ethernet, InfiniBand, NVLink), while balancing the resulting performance with the associated cost of the selected hardware resources.

V-C DLRM Evaluation

Refer to caption
(a) Breakdown of comp./comm. time & per-node memory footprint.
Refer to caption
(b) Effect of memory bandwidth availability to the extended memory system.
Figure 13: DLRM’s training performance normalized to 64 nodes with 2TB/s memory bandwidth.
TABLE III: Per-node details of various cluster configurations.
config compute memory network (topology : bandwidth per node)
node Peak TOPS local cap. (GB) local bw (GB/s) exp. cap. (GB) exp. bw (GB/s)
A0 V100 125 80*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 900 0 0

two-level switch :

A1 480 500 150 GB/s intra-pod, 6.25 GB/s inter-pod
A2 201 1000
B0 A100 625 80 2039 0 0

two-level switch :

B1 480 500 300 GB/s intra-pod, 31.25 GB/s inter-pod
B2 201 1000
C0 H100 1979 80 3350 0 0

two-level switch :

C1 480 500 450 GB/s intra-pod, 62.5 GB/s inter-pod
C2 201 1000
Dojo Tray 54,300 640 16000 0 0

one-level switch : 20×50GB/s2050𝐺𝐵𝑠20\times 50GB/s20 × 50 italic_G italic_B / italic_s per direction

TPU v4 TPU 275 32 1200 39 1200

3D torus : 6×48GB/s648𝐺𝐵𝑠6\times 48GB/s6 × 48 italic_G italic_B / italic_s per direction

*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTAlthough the V100 GPU features 32GB of memory, we model 80GB instead to keep the memory system configuration options of clusters A, B, and C aligned.

We now briefly cover evaluation highlights for the training of a 1.2 trillion parameter DLRM, modeled as described in Table V of Rashidi et al. [56]. Fig. 13a shows the training time breakdown for single DLRM instance and corresponding per-node memory footprint for different cluster sizes. Since the DLRM’s memory footprint is relatively smaller than § V-B’s Transformer-1T model, we start with a smaller cluster comprising only 8 pods of our baseline DGX cluster (64 GPUs in total). As the cluster’s size decreases, the exposed communication delay decreases at the cost of increased memory footprint per node required, resulting in a compute delay increase. However, the overall increase in training time is sublinear with the node count reduction, especially in the 64–16 range. Thus, memory expansion can not only be used to improve DLRM training efficiency, but also better performance for a training workload comprising several DLRM models. This is a common use case, as large corporations often need to train multiple DLRMs for different purposes [1, 42].

Fig. 13b evaluates the overall turnaround time of training 8 DLRMs on 64 GPU nodes as a function of available bandwidth to the expanded memory. While DLRM’s performance is more sensitive to memory bandwidth, results qualitatively match § V-B2’s takeaways. Performance improvement opportunities require memory expansion solutions delivering at least 75% additional memory capacity at 800GB/s; a 200GB expanded memory accessible at 1.5TB/s improves training time by 1.5×\times×.

V-D Comparative DL Training on Different Clusters

We conclude \TheName’s utility demonstration by comparing 11 different DL training clusters, summarized in Table III: nine GPU-based clusters, a Google TPU v4, and a Tesla Dojo cluster. The general structure of the three cluster types is depicted in Fig. 14. Our goal is not to declare the “best” system—as they drastically differ in cost, node count, definition of a “node”, etc.—but to demonstrate the different behavior across three very dissimilar clusters when training the same huge model.

Refer to caption
(a) Cluster of 1024 A100 GPUs, grouped in 128 8-GPU pods.
Refer to caption
(b) Cluster of 4096 TPU v4 chips connected in 3D Torus topology.
Refer to caption
(c) Cluster of 64 Dojo D1 Training Matrices connected by a switch.
Figure 14: Illustration of three cluster types evaluated in § V-D. All link bandwidths shown are per direction.
Refer to caption
Figure 15: Comparison of runtime across different cluster configurations, normalized to cluster A0.
DGX cluster variants

We model three base 1024-GPU cluster variants—A, B, and C—with different resource (compute, memory, network) provisioning. We base each design on a major GPU model (V100, A100 and H100). All GPU cluster variants are organized in 16-GPU pods and feature a two-dimensional network like the one shown in Fig. 7. For each base cluster variant, we evaluate three memory systems—0, 1 and 2—with different characteristics, for a total of nine GPU cluster variants. Memory system 0 consists of only local GPU memory without any capacity expansion. Memory systems 1 and 2 are hypothetical expanded memory systems accessible at 0.5TB/s and 1TB/s, respectively.

TPU v4 cluster

We model a TPU cluster of 4096 TPU-V4 chips connected in a 3D Torus topology with 48GB/s full duplex links. Each TPU features 32MB of on-chip SRAM, a 32GB HBM with a memory bandwidth of 1.2TB/s, and 275 TFLOPS peak performance [71].

Dojo cluster

We model a Dojo cluster of 64 nodes (“trays”, each comprising several training tiles and interface processors). Each node has 66GB of on-chip SRAM, 640GB of memory with a 16 TB/s bandwidth, and 54.3 PFLOPS peak performance. Given limited publicly available information, we model the network topology as a single logical switch delivering 1 TB/s of full duplex network bandwidth to each node [34].

Fig. 15 shows the speedup for DLRM and Transformer-1T across different cluster configurations. Every cluster except A0, B0, C0, and Dojo assumes per-node memory expansion with capacity and bandwidth characteristics listed in Table III. DLRM training is modeled as described in § V-C: Clusters A0, B0, and C0 use 64 nodes to run a single DLRM instance. A1, B1 and C1 leverage their expanded memory to train one DLRM per 16 nodes. Likewise, A2, B2 and C2 use 8 nodes per instance. The reported speedup for DLRM refers to training a total of 8 model instances. For Transformer-1T, speedup refers to training a single instance on the entire cluster. All reported speedups are normalized to cluster A0 as baseline.

Clusters A2, B2 and C2 benefit from their expanded memory at 1TB/s, delivering 1.8×1.8\times1.8 ×, 2.6×2.6\times2.6 ×, and 2.7×2.7\times2.7 × speedups, respectively, for DLRM. While clusters A1, B1 and C1 fare poorly for DLRM due to their lower bandwidth to expanded memory, B1 and C1 deliver a good speedup of 7.2×7.2\times7.2 × and 12.5×12.5\times12.5 ×, respectively, for Transformer-1T. Increasing expanded memory bandwidth to 1TB/s further improves speedup for Transformer-1T (e.g., to 14.3×14.3\times14.3 × for C2).

Cluster A2’s double bandwidth to expanded memory improves cluster A1’s DLRM performance by 1.64×1.64\times1.64 ×. Due to the memory-bound nature of DLRM, memory bandwidth improvements are more critical than compute capacity or network bandwidth. On the other hand, Transformer-1T is more sensitive to the compute capacity and network bandwidth, and therefore low compute and network bandwidth drastically reduces speedup opportunities for clusters A1 and A2. Transformer-1T benefits from TPU’s large compute capacity and network bandwidth, but the DLRM suffers from its low memory capacity and local memory bandwidth. In contrast, Dojo significantly benefits both workloads due to large on-chip SRAM, memory capacity, and high network bandwidth.

Overall, among GPU cluster variants, there is no single optimal configuration for both Transformer-1T and DLRM, as different workloads are impacted differently. Disregarding any cost considerations, the best GPU cluster on average is C0, delivering a 7.7×\times× speedup over the baseline A0 cluster. Memory expansion is an effective technique for all clusters when training Transformers, but only for the lowest-end cluster A on average, due to DLRM’s memory bandwidth sensitivity. This study demonstrates that, as one size does not fit all, system architects should evaluate a workload mix representative of the main use cases for the target cluster to determine the system configuration that best fits the ensemble.

V-E \TheNametakeaways: versatility and speed

The extensive evaluation in § V-B and V-C demonstrates \TheName’s versatility and the breadth of case studies it facilitates. We illustrated that \TheNameenables joint sensitivity analysis of the effect of node compute capability, memory system design, and network provisioning on cluster performance, facilitating balanced resource provisioning and identification of cost-reduction opportunities. \TheNamealso allows rapid exploration and evaluation of memory system design on a cluster’s performance as a function of its capacity and bandwidth characteristics, hel** system designers determine what existing technologies for memory expansion are viable to improve training performance, and gauge the impact of relevant future technologies.

While § V’s evaluation focused on demonstrating the utility and flexibility of the tool, an additional key strength of \TheNameis the short turnaround time of experiments, allowing researchers to rapidly glean performance trends. Once a target model is broken down into its layer-wise representation as specified in § III-A, exploring a broad design space is a matter of few hours. \TheName’s methodology allows modeling different compute nodes, network topologies, memory bandwidths and parallelization strategies in an embarrassingly parallel fashion on commodity processors. To provide some concrete data points, generating the two memory bandwidth sensitivity heatmaps (Fig. 9 for Transformer-1T and Fig. 13b for DLRM) takes about 5 hours / 45 minutes, respectively, on a single 24-core Intel Xeon Silver server. Other analyses presented in the paper require comparable runtimes. Such rapid exploration of a wide range of design choices is a valuable capability \TheNamecontributes.

VI Conclusion

We introduced \TheName, a holistic, end-to-end methodology and workflow for rapid design space exploration of key parameters affecting the performance and efficiency of large-scale distributed DL training: the model’s parallelization strategy and provisioning of each of the cluster’s key resources. We demonstrated \TheName’s utility with case studies on DLRM and Transformer models, deriving actionable cluster design hints for system designers, and identifying required bandwidth and capacity characteristics for emerging memory expansion techniques to have a positive impact on cluster performance on distributed DL training. \TheNameenables the evaluation of a wide spectrum of key cluster design parameters in a matter of few hours using very modest physical hardware resources.

Acknowledgements

We thank William Won, Geonhwa Jeong, Shreerang Dabade and Raveesh Garg for their help in the early stages of constructing the paper’s methodology, as well as Marina Vemmou and Hamed Seyedroudbari for their feedback on drafts of the paper. We also thank Sudarshan Srinivasan for technical discussions on this work. This work was generously supported by a research gift from Samsung Semiconductor’s Memory Solutions Lab.

References

  • [1] B. Acun, M. Murphy, X. Wang, J. Nie, C.-J. Wu, and K. Hazelwood, “Understanding training efficiency of deep learning recommendation models at scale,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 802–814.
  • [2] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “Garnet: A detailed on-chip network model inside a full-system simulator,” in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on.   IEEE, 2009, pp. 33–42.
  • [3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 48, New York, New York, USA, 20–22 Jun 2016, pp. 173–182. [Online]. Available: https://proceedings.mlr.press/v48/amodei16.html
  • [4] P. Belevich, Y. Zhao, S. Li, J. Choi, R. Varma, P. Damania, G. Chauhan, M. Yadav, P.-Y. Aquilanti, and S. Ranganatha, “Training a 1 Trillion Parameter Model With PyTorch Fully Sharded Data Parallel on AWS,” https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff, 2022, [Online; accessed 28-June-2022].
  • [5] A. Castelló, M. Catalán, M. F. Dolz, J. I. Mestre, E. S. Quintana-Ortí, and J. Duato, “Performance modeling for distributed training of convolutional neural networks,” in 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2021, pp. 99–108.
  • [6] F. Checconi, J. J. Tithi, and F. Petrini, “Ridgeline: A 2d roofline model for distributed systems,” 2022.
  • [7] J. Chen, S. Li, R. Guo, J. Yuan, and T. Hoefler, “Autoddl: Automatic distributed deep learning with asymptotically optimal communication,” CoRR, vol. abs/2301.06813, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2301.06813
  • [8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’14, New York, NY, USA, 2014, p. 269–284. [Online]. Available: https://doi.org/10.1145/2541940.2541967
  • [9] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
  • [10] M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, “Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” IBM Journal of Research and Development, vol. 63, no. 6, pp. 1:1–1:11, 2019.
  • [11] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in Proceedings of the 10th ACM Conference on Recommender Systems, ser. RecSys ’16, New York, NY, USA, 2016, p. 191–198. [Online]. Available: https://doi.org/10.1145/2959100.2959190
  • [12] CXL Consortium, “Compute Express Link (CXL) Specification, Revision 3.0, Version 1.0,” 2022, https://www.computeexpresslink.org/_files/ugd/0c1418_1798ce97c1e6438fba818d760905e43a.pdf.
  • [13] N. Ding and S. Williams, “An instruction roofline model for gpus,” in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2019, pp. 7–18.
  • [14] J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, Y. Guo, X. Jiang, L. Tang, Y. Du, Y. Zhang, P. Pan, and Y. Xie, “Eflops: Algorithm and system co-design for a high performance distributed training platform,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 610–622.
  • [15] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-0998.html
  • [16] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, and M. Winslett, “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1061–1080, 09 2021. [Online]. Available: https://doi.org/10.1162/tacl_a_00413
  • [17] C. A. Gomez-Uribe and N. Hunt, “The netflix recommender system: Algorithms, business value, and innovation,” ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, dec 2016. [Online]. Available: https://doi.org/10.1145/2843948
  • [18] D. Gouk, M. Kwon, H. Bae, S. Lee, and M. Jung, “Memory pooling with cxl,” IEEE Micro, vol. 43, no. 2, pp. 48–57, 2023.
  • [19] D. Grechishnikova, “Transformer neural network for protein-specific de novo drug generation as a machine translation problem,” Scientific Reports, vol. 11, no. 1, p. 321, Jan 2021. [Online]. Available: https://doi.org/10.1038/s41598-020-79682-4
  • [20] U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, “The architectural implications of facebook’s dnn-based personalized recommendation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 488–501.
  • [21] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” CoRR, vol. abs/1811.06965, 2018. [Online]. Available: http://arxiv.longhoe.net/abs/1811.06965
  • [22] K. Z. Ibrahim, S. Williams, and L. Oliker, “Performance analysis of gpu programming models using the roofline scaling trajectories,” in Benchmarking, Measuring, and Optimizing, W. Gao, J. Zhan, G. Fox, X. Lu, and D. Stanzione, Eds.   Cham: Springer International Publishing, 2020, pp. 3–19.
  • [23] A. Jain, A. A. Awan, Q. Anthony, H. Subramoni, and D. K. D. Panda, “Performance characterization of dnn training using tensorflow and pytorch on modern clusters,” in 2019 IEEE International Conference on Cluster Computing (CLUSTER), 2019, pp. 1–11.
  • [24] P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, J. Gonzalez, K. Keutzer, and I. Stoica, “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 497–511. [Online]. Available: https://proceedings.mlsys.org/paper/2020/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf
  • [25] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, “Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, Jul. 2019, pp. 947–960. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/jeon
  • [26] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural networks,” in Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, 2019.
  • [27] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Nov. 2020, pp. 463–479. [Online]. Available: https://www.usenix.org/conference/osdi20/presentation/jiang
  • [28] N. P. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” 2023.
  • [29] N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson, “A domain-specific supercomputer for training deep neural networks,” Commun. ACM, vol. 63, no. 7, p. 67–78, jun 2020. [Online]. Available: https://doi.org/10.1145/3360307
  • [30] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated GPU modeling,” in 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020, 2020, pp. 473–486.
  • [31] T. Khan, S. Rashidi, S. Sridharan, P. Shurpali, A. Akella, and T. Krishna, “Impact of roce congestion control policies on distributed training of dnns,” in 2022 IEEE Symposium on High-Performance Interconnects (HOTI), 2022, pp. 39–48.
  • [32] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
  • [33] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow map** over dnn accelerators via reconfigurable interconnects,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 461–475. [Online]. Available: https://doi.org/10.1145/3173162.3173176
  • [34] F. Lambert, “Tesla releases new deep-dive presentations on its dojo ai supercomputer,” Aug 2022, https://electrek.co/2022/08/24/tesla-deep-dive-presentations-dojo-ai-supercomputer/.
  • [35] C. Li, A. Dakkak, J. Xiong, W. Wei, L. Xu, and W.-m. Hwu, “Xsp: Across-stack profiling and analysis of machine learning models on gpus,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 326–327.
  • [36] H. Li, D. S. Berger, S. Novakovic, L. Hsu, D. Ernst, P. Zardoshti, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud platforms,” in Proceedings of the 28th International Conference on Architectural Support for Programming Languages and Operating Systems, 2023.
  • [37] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed: Experiences on accelerating data parallel training,” CoRR, vol. abs/2006.15704, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2006.15704
  • [38] J. Lin, A. Yang, J. Bai, C. Zhou, L. Jiang, X. Jia, A. Wang, J. Zhang, Y. Li, W. Lin, J. Zhou, and H. Yang, “M6-10T: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining,” CoRR, vol. abs/2110.03888, 2021. [Online]. Available: https://arxiv.longhoe.net/abs/2110.03888
  • [39] G. Lu, R. Chen, Y. Wang, Y. Zhou, R. Zhang, Z. Hu, Y. Miao, Z. Cai, L. Li, J. Leng, and M. Guo, “Distsim: A performance model of large-scale hybrid distributed dnn training,” 2023.
  • [40] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, “Tpp: Transparent page placement for cxl-enabled tiered memory,” in Proceedings of the 28th International Conference on Architectural Support for Programming Languages and Operating Systems, 2023.
  • [41] T. Miao, Q. Wu, T. Liu, P. Cui, R. Ren, Z. Li, and G. Xie, “Md-roofline: A training performance analysis model for distributed deep learning,” in 2022 IEEE Symposium on Computers and Communications (ISCC), 2022, pp. 1–8.
  • [42] D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.-H. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. Kishore, T. Graf, A. Eisenman, K. K. Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K. Nair, B. Muthiah, M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L. Qiao, M. Smelyanskiy, B. Jia, and V. Rao, “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22, New York, NY, USA, 2022, p. 993–1011. [Online]. Available: https://doi.org/10.1145/3470496.3533727
  • [43] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
  • [44] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using megatron-lm,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), B. R. de Supinski, M. W. Hall, and T. Gamblin, Eds.
  • [45] M. Naumov, J. Kim, D. Mudigere, S. Sridharan, X. Wang, W. Zhao, S. Yilmaz, C. Kim, H. Yuen, M. Ozdal, K. Nair, I. Gao, B. Su, J. Yang, and M. Smelyanskiy, “Deep learning training in facebook data centers: Design of scale-up and scale-out systems,” CoRR, vol. abs/2003.09518, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2003.09518
  • [46] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” CoRR, vol. abs/1906.00091, 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1906.00091
  • [47] NVIDIA, “NVIDIA Collective Communication Library (NCCL),” https://developer.nvidia.com/nccl.
  • [48] NVIDIA, “Nvidia dgx a100,” Cloud & Data Center DGX A100, Aug 2022, https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-a100-80gb-datasheet.pdf.
  • [49] G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato, and M. Püschel, “Applying the roofline model,” in 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014, pp. 76–85.
  • [50] M. Ott, S. Shleifer, M. Xu, P. Goyal, Q. Duval, and V. Caggiano, “Fully Sharded Data Parallel: faster AI training with fewer GPUs,” https://engineering.fb.com/2021/07/15/open-source/fsdp/, 2021, [Online; accessed 28-June-2022].
  • [51] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to dnn accelerator evaluation,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315.
  • [52] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 58–70.
  • [53] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–16.
  • [54] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21, New York, NY, USA, 2021. [Online]. Available: https://doi.org/10.1145/3458817.3476205
  • [55] S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling compute-communication overlap in distributed deep learning training platforms,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 540–553.
  • [56] S. Rashidi, P. Shurpali, S. Sridharan, N. Hassani, D. Mudigere, K. Nair, M. Smelyanski, and T. Krishna, “Scalable distributed training of recommendation models: An astra-sim + ns3 case-study with tcp/ip transport,” in 2020 IEEE Symposium on High-Performance Interconnects (HOTI), 2020, pp. 33–42.
  • [57] S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).   IEEE, 2020, pp. 81–92.
  • [58] S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22, New York, NY, USA, 2022, p. 581–596. [Online]. Available: https://doi.org/10.1145/3470496.3527382
  • [59] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, “ZeRO-Offload: Democratizing Billion-Scale model training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21).   USENIX Association, Jul. 2021, pp. 551–564. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/ren-jie
  • [60] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/f63f65b503e22cb970527f23c9ad7db1-Paper.pdf
  • [61] Y. Ren, S. Yoo, and A. Hoisie, “Performance analysis of deep learning workloads on leading-edge systems,” in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2019, pp. 103–113.
  • [62] G. F. Riley and T. R. Henderson, “The ns-3 network simulator,” in Modeling and Tools for Network Simulation, 2010, pp. 15–34. [Online]. Available: https://doi.org/10.1007/978-3-642-12331-3_2
  • [63] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “A systematic methodology for characterizing scalability of dnn accelerators using scale-sim,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2020, pp. 58–68.
  • [64] A. Shah, V. Chidambaram, M. Cowan, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “Synthesizing collective communication algorithms for heterogeneous networks with TACCL,” CoRR, vol. abs/2111.04867, 2021. [Online]. Available: https://arxiv.longhoe.net/abs/2111.04867
  • [65] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman, “Mesh-tensorflow: Deep learning for supercomputers,” in Advances in Neural Information Processing Systems, vol. 31, 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/3a37abdeefe1dab1b30f7c5c7e581b93-Paper.pdf
  • [66] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  • [67] B. Smith and G. Linden, “Two decades of recommender systems at amazon.com,” IEEE Internet Computing, vol. 21, no. 3, pp. 12–18, 2017.
  • [68] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro, “Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model,” CoRR, vol. abs/2201.11990, 2022. [Online]. Available: https://arxiv.longhoe.net/abs/2201.11990
  • [69] P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan, “Gradientflow: Optimizing network performance for large-scale distributed dnn training,” IEEE Transactions on Big Data, vol. 8, no. 2, pp. 495–507, 2022.
  • [70] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. **, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda: Language models for dialog applications,” CoRR, vol. abs/2201.08239, 2022. [Online]. Available: https://arxiv.longhoe.net/abs/2201.08239
  • [71] “Tpu v4 system architecture,” Cloud TPU Documentation Guides, https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v4.
  • [72] J. Vamathevan, D. Clark, P. Czodrowski, I. Dunham, E. Ferran, G. Lee, B. Li, A. Madabhushi, P. Shah, M. Spitzer, and S. Zhao, “Applications of machine learning in drug discovery and development,” Nature Reviews Drug Discovery, vol. 18, no. 6, pp. 463–477, Jun 2019. [Online]. Available: https://doi.org/10.1038/s41573-019-0024-5
  • [73] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • [74] M. Wang, L. Dong, W. Dong, and Z. Lv, “A deep neural network optimization strategy based on roofline model,” in 2022 3rd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), 2022, pp. 103–109.
  • [75] M. Wang, C. Meng, G. Long, C. Wu, J. Yang, W. Lin, and Y. Jia, “Characterizing deep learning training workloads on alibaba-pai,” in 2019 IEEE International Symposium on Workload Characterization (IISWC), 2019, pp. 189–202.
  • [76] Y. Wang, C. Yang, S. Farrell, Y. Zhang, T. Kurth, and S. Williams, “Time-based roofline for deep learning performance analysis,” in 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), 2020, pp. 10–19.
  • [77] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, p. 65–76, apr 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
  • [78] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface’s transformers: State-of-the-art natural language processing,” CoRR, vol. abs/1910.03771, 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1910.03771
  • [79] W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” CoRR, vol. abs/2303.14006, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.14006
  • [80] C. Yang, “8 steps to 3.7 tflop/s on NVIDIA V100 GPU: roofline analysis and other tricks,” CoRR, vol. abs/2008.11326, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2008.11326
  • [81] C. Yang, Y. Wang, T. Kurth, S. Farrell, and S. Williams, “Hierarchical roofline performance analysis for deep learning applications,” in Intelligent Computing, K. Arai, Ed.   Cham: Springer International Publishing, 2021, pp. 473–491.
  • [82] S. Zhang, L. Diao, S. Wang, Z. Cao, Y. Gu, C. Si, Z. Shi, Z. Zheng, C. Wu, and W. Lin, “Auto-parallelizing large models with rhino: A systematic approach on production AI platform,” CoRR, vol. abs/2302.08141, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.08141
  • [83] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recommender system: A survey and new perspectives,” ACM Comput. Surv., vol. 52, no. 1, feb 2019. [Online]. Available: https://doi.org/10.1145/3285029
  • [84] Y. Zhao, R. Varma, C.-C. Huang, S. Li, M. Xu, and A. Desmaison, “Introducing PyTorch Fully Sharded Data Parallel (FSDP) API,” https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/, 2022, [Online; accessed 28-June-2022].
  • [85] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica, “Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Jul. 2022, pp. 559–578. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
  • [86] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai, “Deep interest evolution network for click-through rate prediction,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 5941–5948, Jul. 2019. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/4545