HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: textgreek
  • failed: textgreek
  • failed: kantlipsum

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.16492v1 [cs.PF] 29 Jan 2024

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T. Kandemir, Chita R. Das
Computer Science and Engineering
The Pennsylvania State University
University Park, PA, USA
{abs5688,vmbhasi,sms821,gik2,mtk2,cxd12}@psu.edu
Abstract.

Deep learning (DL) constitutes a significant workload within public or private datacenters. Organizations commonly engage in training multiple Deep Neural Networks (DNNs) in multi-tenant GPU clusters. The primary consideration in training such models is the associated ”cost”, which correlates directly with the GPU usage duration in the cluster. Previous studies have demonstrated that communication overheads of interconnected GPU clusters could be a significant component of the overall training time and, thus, minimizing the communication overhead with intelligent scheduling of training jobs on physically close GPUs should minimize the long training times. However, state-of-the-art cluster schedulers are rather agnostic to the proximity-based job consolidation. To compound the issue, different DNN models display varied communication overheads even with similar levels of consolidation. This results in prolonged training times.

In this context, we propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity-based consolidation of GPU resources based on the DDL jobs’ sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an ”auto-tuner” mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

copyright: none

1. INTRODUCTION

Deep learning (DL) with large-scale cloud computing resources has facilitated the development of deep (very large) neural network (DNN) models, driving cutting-edge advancements in AI. These models, powering breakthroughs in domains such as autonomous vehicles (self-driving, ), multimedia processing (multimedia-DL, ) and generative AI (GAI, ), often require training hundreds of millions, if not billions or trillions, of parameters to achieve the desired accuracy. To expedite model convergence within a reasonable time-frame, parallel training paradigms based on data parallelism (pytorch_dist, ) and model parallelism (megatron, ) are employed, simultaneously utilizing multiple GPUs (or related ”neural processing” accelerators like TPUs (tpu, )) to train a single model. Furthermore, AI enterprises or groups may train hundreds or even thousands of such DNN models (philly_trace_analysis, ). In certain cases, a Machine Learning as a Service (MLaaS) (mlaas, ) provider could aggregate a substantial number of DL jobs from its diverse clientele into a single extensive GPU cluster. However, GPUs are costly, and the prolonged training times of these models lead to exorbitant training expenses.

To alleviate the financial burden of DL, DNN workloads are executed on ”shared” GPU clusters, such as those offered by public cloud providers or within enterprise data centers (or private clouds). In the case of public cloud services, the cost is determined by the wallclock time spent utilizing cloud VMs, while for shared data centers, the cost is proportional to the time spent using GPU resources.

Recent studies (pipedream, ; alibaba_pai, ; stash, ) have shown that communication overhead is a major contributor to prolonged training times (and increased costs) in Distributed DL (DDL). This stems from the fact that GPUs within a cluster are interconnected through various types of networks. During DNN training, collective communication algorithms such as all-reduce, scatter-gather, and others are employed to exchange tensors among GPU workers. This communication latency can vary significantly depending on the underlying network type connecting the GPUs. Generally, the more proximal (physically close) the GPUs of a job are to each other within the network, i.e., the more ”consolidated” they are, the lower the communication overhead.

A DDL job can be consolidated in three typical ways: (i) placing all worker GPUs on the same physical machine; (ii) locating all worker GPUs on the same rack interconnected via Infiniband (Infiniband, ) technology; or (iii) spreading the worker GPUs across the network through inter-rack connections111Hereafter referred to as the ”network.”. Each of these ”network tiers” introduces significantly different communication delays, with the least overhead on the same machine, followed by the rack, and then the network. To mitigate communication overhead, a straightforward approach involves consolidating all GPUs of a job as much as possible. However, due to multi-tenancy, when cluster managers attempt to consolidate a job on physically proximal GPUs, they often encounter delays while waiting for resources to become available, leading to increased queueing times. Furthermore, previous research (tiresias, ; stash, ) has demonstrated that DNN models exhibit varying degrees of ”network sensitivity,” implying that each DNN model would experience a different amount of increase in communication overhead as network conditions deteriorate. Thus, not all models would equally benefit from communication overhead reduction through consolidation but they would all incur the same queueing delay. Hence, it is crucial to consolidate DNN jobs based on their ”network sensitivities” i.e. relative communication overheads, in order to minimize queueing delays due to consolidation.

Prior cluster scheduler Tiresias (tiresias, ) implements a consolidation strategy for DDL jobs based on the model’s ”skew.” In this strategy, jobs with high skew are consolidated into as few machines as possible, while others are left unconsolidated. This approach may lack flexibility for ”moderate consolidation,” such as placing a job on the rack for a ”moderately network-tier” sensitive model. Moderate consolidation is now more viable for communication, thanks to the presence of contemporary networking hardware that offers high quality-of-service (QoS).

The contemporary Infiniband based (Infiniband, ) rack switch offers remarkably high bandwidth, reaching up to 400 Gb/s per port with the utilization of GPU RDMA (gpudirect-RDMA, ), effectively reducing rack-level communication overhead. Moreover, even modern network (Ethernet) based switches provide 800 Gb/s bandwidth per port, albeit with higher latency (nvidia-spectrum, ). Yet, DDL cluster schedulers remain oblivious to such advancements and are unable to harness the consequent potential performance improvements. To address these limitations, schedulers should be equipped with the ability to dynamically adjust consolidation based on the network sensitivity of individual jobs. This adjustment should take into account the diverse performance characteristics of different network tiers, leveraging the capabilities of modern networking hardware essential for lowering communication overhead.

To achieve these goals, we propose Dally, a holistic cluster scheduler for DL job placement that accounts for the underlying system network topology, as well as the network sensitivity of each job to make intelligent placement decisions. Dally accomplishes this by employing the classical delay scheduling algorithm (delay_scheduling, ) in the context of DDL workloads, coupled with a novel network-sensitive pre-emption strategy for job placement and consolidation. Specifically, for every DDL job, Dally consults its dedicated ‘delay timers’ to decide whether to accept a resource offer at the offered network-tier or wait in anticipation of an offer with better network-tier until a delay timer has elapsed. Traditionally, the delay timers are hand-tuned and remain fixed for a given system. However, in this work, we introduce a novel ‘auto-tuner’ that dynamically evolves with the system to tune these delay timers for high-efficacy delay scheduling. With the help of these innovations, Dally is able to outperform state-of-the-art schedulers in terms of makespan, job completion time (JCT) and communication overhead.

For evaluation of Dally, we have developed a simulation platform – ArtISt-sim. The use of simulations is motivated by the fact that development of novel DL cluster schedulers is currently impeded by the exorbitant costs associated with the large-scale experimental setups needed for such endeavors. A single GPU for DL costs thousands of dollars, and replicating a real-world DL data center necessitates a substantial number of these GPUs for extended periods.

ArtISt-sim has been developed by extending prior work (themis-nsdi, ) and utilizing ASTRA-sim (rashidi2020astra, ), a highly accurate DDL job simulator. Our simulation platform possesses two key capabilities: (i) it can dynamically determine network slowdowns based on the specific placement of a DDL job (its assigned GPUs) within the simulated cluster, and (ii) it can simulate a datacenter equipped with modern networking hardware (e.g., NVIDIA Quantum (nvidia-quantum, ) and Spectrum switches (nvidia-spectrum, )) which is crucial for this work. This combination of features enables ArtISt-sim to provide a realistic and accurate representation of modern DDL workloads. Moreover, to the best of our knowledge modern high speed network switches such as the ones stated before are unavailable in any public cloud for use as a test-bench. Hence, ArtISt-sim can enable researches to develop novel DDL scheduling techniques with modern networking hardware.

To summarize, in this paper, our primary contributions are outlined as follows:

  • We present Dally, a novel DL cluster scheduler, crafted to reduce end-to-end Makespan, average JCT and communication overhead. Dally employs delay scheduling to achieve optimal job consolidation and placement and implements a network-sensitive preemption policy, facilitating prioritized placements of jobs with heightened sensitivity to network conditions.

  • Dally also incorporates an ”auto-tuner” capable of dynamically adjusting the delay timers, based on network usage, required for delay scheduling.

  • Further, we develop a high-fidelity (iteration-level) DDL cluster simulator using ASTRA-sim called ArtISt-sim, which effectively emulates a DDL data center equipped with contemporary networking hardware.

  • Finally, extensive evaluation of Dally using ArtISt-sim, encompassing both batched and randomly arriving workloads of six real-world DNN models, reveals significant improvements. Dally improves makespan by over 69% compared to consolidation based policy, reduces average JCT by more than 83%, and slashes average communication overhead by 98% under congested networking conditions.

2. BACKGROUND AND RELATED WORK

2.1. Delay scheduling based on data locality

Delay scheduling is a traditional scheduling algorithm used in big-data clusters that defers the scheduling of a job to the future when a suitable/preferred job placement destination (hardware) is not immediately available. Instead of promptly accepting a resource offer from any worker machine, a job undergoes a ”delay time”, anticipating the availability of a worker node with enhanced data locality in the future. Data locality refers to the location of the input data for the job, which could be on the same machine, the same rack, or somewhere accessible over the network. The farther the input data is from the job, the longer it takes for the data to be read. System administrators set the delay time, often maintaining the default values established by cluster schedulers. Traditional big-data schedulers, like YARN (yarn, ) and Spark (spark, ), typically implement delay scheduling as a ”default” strategy for better data locality. Note, deep learning jobs typically assume the availability of input data on the local machine or through fast remote storage.

2.2. Simulation Platforms

ASTRA-sim (Accelerator Scaling for TRAining Simulator) is a comprehensive simulation platform facilitating parameterized descriptions of DNN model, system, and network fabric for end-to-end simulation of a DDL loop. ASTRA-sim users can configure various DDL training setups, including DNN models (number and types of layers, GEneral Matrix Multiplication (GEMM) size (gemm, ), etc), DDL parallelism (data, model, hybrid, etc.), fabric design (number of links, latency/bandwidth per link, etc.), and fabric topology (2D/3D torus, switch, ring, etc). ASTRA-sim leverages network-layer simulators, e.g., Garnet (garnet, ), Analytical (analytical, ), and ns3 (rashidi2020scalable, ), for packet-accurate simulations. An improved version of ASTRA-sim (won2023astrasim20, ) with capabilities such as arbitrary model parallelism, improved hierarchical topology parameterization and enhanced memory system modeling is now available. Also, (Akella22, ) developed a simulator using ASTRA-sim and ns3 to model various congestion control schemes.

ASTRA-sim expects the following three input files to simulate a single loop of DDL training:
\bullet Workload file: This file contains the delay in cycles for the GEMM operations of each layer during the forward and backward pass along with layer-wise communication size.
\bullet System file: The system file provides parameters to the network scheduler such as topology, packet scheduling policy (e.g., LIFO, FIFO), all reduce implementation, etc.
\bullet Network file The network files encompass details regarding the physical network topology, network dimensions, link information (bandwidth, latency, etc.) etc.

We want to emphasize however that ASTRA-sim functions as a single DDL job simulator and lacks the capability to simulate a large GPU cluster running multiple DL jobs.

2.3. Prior Work on DL cluster scheduling

A plethora of DL cluster schedulers have been developed recently with varying objectives such as JCT performance, fairness, and SLO compliance. In terms of performance, approaches like (lucid-asplos23, ; ant-man, ) focus on packing multiple DNN models onto a single GPU, boosting parallelism and decreasing overall execution time. Our work, centered around network sensitivity to enhance performance, is orthogonal to such efforts, as our proposed techniques can be readily applied to GPUs simultaneously running multiple models. However, the absence of communication characterization for such ”packed models” hinders the application of our approach. Network agnostic works such as (pollux, ; optimus, ; gandiva, ) improve performance through resource reallocation, while (horus, ) is an interference-aware scheduler which is designed for a heterogeneous GPU cluster. For fairness, (themis-nsdi, ; gandiva-fair, ) have been proposed, while (chronus, ) is an SLO compliant scheduler.

3. MOTIVATION

A typical DL cluster today comprises a large number of GPUs communicating via different types of network interconnects. These interconnects may be (i) intra-machine like NVSwitch (nvlink-nvswitch, ); (ii) intra-rack connected via Infiniband; or, (iii) inter-rack connected via modern network (Ethernet) switches. Such a hierarchical network structure is depicted in Fig. 1. In this figure, a job mapped to GPUs on the same machine is shown in red; a job running on the same rack (but on different machines) is marked in blue; and a job mapped to GPUs across the network is shown in grey. The communication overhead is lowest for the jobs highlighted in red, followed by those in blue and then grey.

Refer to caption
Figure 1. A typical (hierarchical) datacenter n/w.

3.1. Communication overhead in DDL Clusters

DDL training throughput is heavily influenced by communication speed and latency due to the significant volume of data transfer among DNN workers during gradient synchronization. As highlighted in (kungfu, ), an 8-GPU server engaged in training a ResNet50 model produces 4 GB of gradients per second. Under layer-wise model synchronization, starting from the output layers under back-propagation: (i) gradients (of the loss training-objective based on the GPUs’ assigned training-data batches) are computed with respect to parameters of the current layer, (ii) these gradients are then exchanged among the GPUs, and finally, (iii) the current layer’s parameters are simultaneously updated by each GPU. Hence, multiple DDL models can quickly overwhelm the link bandwidth. This underscores the importance of high-speed and low-latency interconnects between GPUs for mitigating communication-related stalls. Previous studies such as (pipedream, ; stash, ) reveal that the communication overheads can be as high as 5×5\times5 × the actual compute time. To circumvent such substantial overheads, schedulers like Tiresias strive to consolidate DDL jobs on the most proximal GPUs possible.

Refer to caption
Figure 2. Single iteration running time for models consolidated on the same machine, rack, and across the network

3.2. Job consolidation in DL cluster

Cluster schedulers like Tiresias (tiresias, ) aim to reduce communication overhead by consolidating job placement, and assigning GPUs to a job within the fewest machines possible. However, due to multi-tenancy, such consolidation can lead to lengthy queueing delays, depending on the cluster’s load. Individual jobs may reject resource offers generated by the cluster scheduler if the offered GPUs are not located on the same machine or rack, especially when GPU demand is high. As a result, jobs remain queued, increasing their job completion time (JCT).

However, not all jobs benefit equally from consolidation, as network overhead varies across DNN models. More specifically, jobs which are less sensitive to network quality (less ‘network-sensitive’ models) can tolerate less stringent network consolidation (consolidation on physically distant GPUs) compared to highly ‘network-sensitive’ jobs. To illustrate this, in Fig. 2, we simulate the single iteration DDL running times using ASTRA-sim for various models with different network placements. As expected, in general, consolidation on physically proximal GPUs yields lower run times. However, we observe significant variations in communication overheads across network tiers for different DNN models (also shown in Table 1). Thus, model network sensitivities could act as a potentially relevant metric in determining the priority of each job with respect to consolidation: the higher the network sensitivity of a model, the more it should be prioritized to get an allocation with physically proximal (consolidated) GPUs.

In this direction, Tiresias, seeks to categorize network-sensitive jobs by employing model ”skew,” which is defined as the ratio of the largest tensor size to the overall model size. Tiresias consolidates jobs with high skew, while other jobs accept any resource offer they receive. However, using ASTRA-sim to analyze the communication overheads of various models under different network placements and high speed networking hardware reveals weak correlation between skew and network sensitivity as shown in Table 1. For e.g. ResNet18 has low skew but very high network level communication overhead and VGG11 has high skew but low communication overhead. This could be due to the rapid advancements in state-of-the-art (SOTA) network hardware.

Model Skew
Machine
Rack
Network
VGG11 (vgg, ) High 1% 6% 7%
AlexNet (alexnet, ) High 2% 13% 100%
MobileNetV3 (mobilenet, ) High 42% 940% 19592%
ResNet18 (resnet, ) Low 7% 116% 2749%
ResNet50 (resnet, ) Low 12% 12% 38%
BERT large (bert, ) Low 8% 23% 715%
Table 1. Communication overhead as a % of compute time.

3.3. Improvements in SOTA network hardware

Recent advancements in network hardware have introduced remarkably high network bandwidth with exceptionally low latency. Hardware manufacturers, such as NVIDIA, are now offering network switches that deliver several-fold improvements in QoS compared to previous-generation hardware. For intra-machine networking, NVSwitch (nvlink-nvswitch, ) provides a bandwidth of up to 900 Gb/s, surpassing PCIe’s bandwidth of only up to tens of GB/s combined with high latency (li2019evaluating, ). Furthermore, for Infiniband-based intra-rack networking, NVIDIA quantum switches now offer up to 400 Gb/s bandwidth per port with ultra-low latency using GPU direct RDMA. In contrast, standard Infiniband offers only up to 50 Gb/s bandwidth while experiencing relatively high latency (up to tens of microseconds) (Infiniband-eval, ). Finally, NVIDIA spectrum Ethernet switches provide up to 800 Gb/s bandwidth for inter-rack networking. However, current schedulers lack awareness of such high-speed network hardware, which could be harnessed for improved performance. This underscores the demand for a novel approach to DL cluster scheduling.

3.4. Limitations of current SOTA DL cluster schedulers

Current SOTA schedulers like Tiresias and Gandiva (gandiva, ) have several shortcomings such as: (i) Leveraging only limited knowledge of the data center network: due to the lack of awareness of the various network tiers, for e.g., the medium-bandwidth GPU direct RDMA that exists within a rack, these schedulers often end up making a sub-optimal job placement decision, in this case, map** the job across multiple racks. (ii) A strict consolidation policy: restricting the job placement to either the fewest machines possible (for high skewed models) may lead to high queueing delays in case the former is not available. This necessitates an intermediate approach, allowing for a gradual and graceful ‘relaxation’ of consolidation, particularly for moderately sensitive jobs based on queueing delay and network sensitivity. (iii) Sub-optimal job priority: Tiresias employs a generalized version of Least Attained Service (LAS) (las, ) called ”Discretized 2D-LAS” (2DAS) (tiresias, ) to determine job priority. The 2DAS value for a job is calculated by multiplying the time for which it has been running by the number of GPUs it utilizes. Tiresias prioritizes jobs based on their 2DAS values, with lower values indicating higher priority. Consequently, a job with the highest 2DAS value is least likely to be scheduled for execution and most likely to be preempted out. However, a job with unfavorable network placement and high network sensitivity may have been running for an extended period but may not have made significant progress due to the excessive network overhead associated with its placement. This would lead to a high 2DAS value and make it less likely to receive resource offers that are consolidated, as jobs with lower 2DAS values would receive the most favorable offers.

These limitations of existing DL cluster schedulers underscore the need for a new approach that incorporates both network awareness and network sensitivity-based job consolidation. However, development of such scheduler for cutting-edge network technology is hindered by the high cost of the associated experimental setups and the availability of such hardware, necessitating a high-fidelity cluster simulator.

3.5. A high-fidelity simulator for DDL scheduling research

The exorbitant cost of deep learning GPUs, with several GPUs required to create a large deep-learning cluster, poses a significant challenge in conducting real-world experiments. This, combined with the need to run such experiments for extended periods, can escalate the cost to the hundreds of thousands of dollars. Even utilizing public cloud resources may prove uneconomical as deep learning machines like the AWS P4 instances incur hourly charges of more than $32/hour (aws_pricing, ). Further complicating matters, research based on cutting-edge network hardware necessitates a simulation platform due to its limited availability. To the best of our knowledge, no public cloud provider offers access to such hardware. To address these challenges, research efforts like (rashidi2022themis, ) have turned to ASTRA-sim as a viable tool for simulating the latest network hardware. However, ASTRA-sim is a single DDL job simulator.

Existing multi-job DDL cluster scheduler simulators, such as (themis-nsdi, ), suffer from low accuracy due to their unrealistic approach to network overhead modeling. For example, (themis-nsdi, ) implements static slowdown penalties of job placement on the rack and the network irrespective of the DNN model or network hardware. But to achieve high confidence in the simulation results, a DL cluster scheduler simulator must dynamically determine the accurate network overhead of a job based on specific placement within the cluster and the type of network hardware used to accurately reflect real-world conditions. In response to this requirement, we propose a multi-job DL cluster simulator that employs ASTRA-sim to accurately determine network overhead.

4. DALLY DESIGN

We now describe Dally, a novel network-placement sensitive DL cluster scheduler that leverages the principles of delay scheduling and novel network-sensitive preemption policy. In the following sections, we delve into the details of our system design, including the simulation methodology employed to evaluate its effectiveness.

4.1. Overview

4.1.1. System goal

The system processes a stream of DL/DDL jobs that need to be scheduled for execution. These jobs exhibit heterogeneity in terms of the DNN model, the number of iterations required to attain the target accuracy, the compute time per iteration, and the GPU demand per job. The system may encounter jobs arriving in batches so that the aggregate GPU demand across all jobs would normally surpass the total number of GPUs available in the cluster. Alternatively, jobs may arrive according to a Poisson distribution. Our goal is to construct a DL cluster scheduler that aligns with the target objectives and assumptions elaborated upon in the subsequent sections. This scheduler will function within a cluster comprising homogeneous GPUs connected via latest-generation network hardware in a hierarchical structure, as discussed in Section 3.

4.1.2. Objective

The overarching goal of our scheduler is to minimize cloud expenditure for its users by reducing the time spent occupying cluster GPUs. This objective is achieved through two primary strategies.

  1. (1)

    Minimizing the end-to-end Makespan for training multiple DNN models when submitted as a batch.

  2. (2)

    Reducing job completion time (JCT) for individual jobs.

4.1.3. Constraints

Dally must achieve the aforementioned objectives subject to the following constraints:

  1. (1)

    Unknown JCT distributions: We assume that the prior running time of any job is unknown.

  2. (2)

    Absence of model architecture information: The scheduler lacks the knowledge of the specific DNN model architecture associated with each job.

  3. (3)

    Unknown job submission pattern: The scheduler lacks information about the exact order and frequency of job arrivals.

  4. (4)

    Unknown GPU demand: The GPU resource requirements for individual jobs and the overall GPU demand across all running jobs are unknown beforehand.

4.1.4. System workloads

We employ the cluster job trace made available by SenseTime (sensetime, ) through the work presented in (lucid-asplos23, ). The traces are more recent than those utilized in (philly_trace_analysis, ) and provide a more realistic representation of a real-world DL cluster with multiple jobs being submitted over an extended period. This trace encompasses individual job arrivals alongside pertinent metadata information for each job.

4.2. Simulation methodology

In this work, we develop a high-fidelity DDL cluster simulator – Artificial Intelligence System Simulator (ArtISt-sim) – capable of accurately replicating network slowdowns induced by specific job placements throughout a job’s execution.

4.2.1. Simulator design

ArtISt-sim builds upon a prior work, Themis (themis-nsdi, ), which is a DL cluster scheduler simulator, and ASTRA-sim. The choice of Themis is based on its unique capability to simulate DL jobs at an iteration level, i.e., a feature not present in other simulators like (lucid-asplos23, ; chronus-socc21, ; tiresias, ). This granular simulation approach enables us to precisely track ”job progression” for preemption and subsequent placement decisions throughout the job’s lifetime. However, as previously discussed, the network slowdown in the original work is encoded statically. To address this limitation, in ArtISt-sim, we modify Themis to invoke ASTRA-sim for each new job placement (refer Fig. 3).

For executing ASTRA-sim with specific network placements, ArtISt-sim dynamically generates the necessary input – workload, system, and network topology files and passes them to ASTRA-sim (blue arrow in Fig. 3). These generated files delineate the ”portion” of the data-center network utilized by the worker GPUs. For instance, in Fig. 1, the grey colored job occupies 8 GPUs across two machines connected via a network link. To simulate ASTRA-sim for such a placement, ArtISt-sim produces the 8-GPU configuration files with a two-tier hierarchy representing the placement. The configuration files specify that the intra-machine GPUs are connected via NVSwitch in a ring topology, and the two machines are linked via a network (Ethernet) switch using a switch topology. Other ASTRA-sim configurations, such as collectives, are also dynamically generated. ASTRA-sim then accurately simulates the job for a single DDL iteration based on the placement and relays the communication overhead back to ArtISt-sim (blue arrow in Fig. 3), which subsequently utilizes this communication overhead to simulate the job’s progression (iterations) until it is either interrupted due to preemption or successfully completes its execution. Note that the compute time per iteration is extracted from the workload trace, and ASTRA-sim serves solely to calculate the network overhead for the specific placement. Network overhead is calculated by first finding the percentage of exposed communication time to the compute time in the ASTRA-sim simulation. The percentage value is applied to the iteration time from the trace to calculate the network overhead and the actual iteration time in ArtISt-sim.

Refer to caption
Figure 3. Simulation design

4.2.2. Simulation calibration and accuracy

To ensure the fidelity of our simulations, we employ a calibration process that incorporates real-world measurements into ArtISt-sim. We execute each model listed in Table 1 on an 8-GPU machine connected via NVSwitch and calculate the associated network overhead. Along with this, we simulate the model with the same network configurations (NVSwitch) in ASTRA-sim. Subsequently, we scale the workload configuration file of ASTRA-sim (for each model) to eliminate the discrepancy between simulated and actual network overhead. These modified workload files are then used for all ASTRA-sim calls. The ”calibrated” simulator is now ready to simulate overhead arising from rack and network-level placements.

For the verification of the various baseline scheduler implementations in ArtISt-sim, we rely on the fact that Themis has already been validated in (themis-nsdi, ). Furthermore, we faithfully reproduce the design principles of the Tiresias and Gandiva schedulers as outlined in their respective publications.

4.2.3. Implementation

We have implemented ArtISt-sim in about 2000 lines of java.222Upon acceptance of this work, we will freely provide comprehensive details of all configurations along with the open-source code. ArtISt-sim requires the network configuration parameters for intra-machine, intra-rack, and inter-rack interconnects. These parameters encompass the network topology, bandwidth, latency, collective communication algorithm, and other relevant details. Additionally, the user needs to specify the number of machines per rack and the total number of racks in the cluster.

4.3. Design of the Dally scheduler

Dally is designed to minimize cloud expenditures in both public and private cloud environments. To achieve this objective, Dally uses delay scheduling for job placement and employs a novel preemption strategy that incorporates network sensitivity. When a job is submitted to the cluster, the global scheduler (refer to Fig. 4) initially places it into a wait queue. The scheduler then periodically generates ”resource offers” to jobs in the wait queue based on the availability of cluster resources. If a job’s local scheduler (Fig. 4) decides to accept the offered resources, it is moved to the run queue, where it executes until it is either preempted or completes its execution. Upon preemption, the job is reintroduced into the wait queue and remains there until it accepts another resource offer. An instance of Dally’s mechanisms governing preemption and job placement is elaborated upon in the following sections.

Refer to caption
Figure 4. Scheduling scheme.

4.3.1. Preemption priority

As already discussed in Section 3, Tiresias’s 2DAS based priority mechanism is oblivious to network slowdowns. In a large networked cluster, an effective priority mechanism should consider the relative network sensitivity of jobs and the slowdowns they have experienced. Therefore, we propose a network-sensitive priority metric to address this shortcoming. We define the number of iterations completed as Icomplsubscript𝐼𝑐𝑜𝑚𝑝𝑙I_{compl}italic_I start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_l end_POSTSUBSCRIPT and the total expected iterations as Itotalexpectedsubscript𝐼𝑡𝑜𝑡𝑎𝑙𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑I_{total-expected}italic_I start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l - italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT. Our approach necessitates awareness or prediction of the overall expected iteration count. This information can be readily obtained from openly accessible benchmarks like MLperf (mlperf_mlsys20, ). Alternatively, users can estimate the total iteration count needed for convergence through established extrapolation techniques such as (optimus, ). It is important to emphasize that only an approximate value for the total expected iterations suffices.

Next, we define Trunsubscript𝑇𝑟𝑢𝑛T_{run}italic_T start_POSTSUBSCRIPT italic_r italic_u italic_n end_POSTSUBSCRIPT as the time spent by the job in the run queue and the total ideal job running time Ttotalidealrunsubscript𝑇𝑡𝑜𝑡𝑎𝑙𝑖𝑑𝑒𝑎𝑙𝑟𝑢𝑛T_{total-ideal-run}italic_T start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l - italic_i italic_d italic_e italic_a italic_l - italic_r italic_u italic_n end_POSTSUBSCRIPT as the compute time per iteration of a job multiplied by the total number of expected iterations. The compute time per iteration can be measured by running a single training iteration of the model on a single GPU, where the iteration comprises the forward pass and backward pass (we assume no communication overhead for ideal JCT calculations). We now define work completed Wcomplsubscript𝑊𝑐𝑜𝑚𝑝𝑙W_{compl}italic_W start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_l end_POSTSUBSCRIPT and normalized running time Tnormsubscript𝑇𝑛𝑜𝑟𝑚T_{norm}italic_T start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT as:

Wcompl=IcomplItotalexpectedTnorm=TrunTtotalidealrun.formulae-sequencesubscript𝑊𝑐𝑜𝑚𝑝𝑙subscript𝐼𝑐𝑜𝑚𝑝𝑙subscript𝐼𝑡𝑜𝑡𝑎𝑙𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑subscript𝑇𝑛𝑜𝑟𝑚subscript𝑇𝑟𝑢𝑛subscript𝑇𝑡𝑜𝑡𝑎𝑙𝑖𝑑𝑒𝑎𝑙𝑟𝑢𝑛\displaystyle W_{compl}=\frac{I_{compl}}{I_{total-expected}}\qquad T_{norm}=% \frac{T_{run}}{T_{total-ideal-run}}.italic_W start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_l end_POSTSUBSCRIPT = divide start_ARG italic_I start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l - italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT end_ARG italic_T start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = divide start_ARG italic_T start_POSTSUBSCRIPT italic_r italic_u italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l - italic_i italic_d italic_e italic_a italic_l - italic_r italic_u italic_n end_POSTSUBSCRIPT end_ARG .

Finally, we define our network sensitive priority Nwsens𝑁subscript𝑤𝑠𝑒𝑛𝑠Nw_{sens}italic_N italic_w start_POSTSUBSCRIPT italic_s italic_e italic_n italic_s end_POSTSUBSCRIPT as:

Nwsens𝑁subscript𝑤𝑠𝑒𝑛𝑠\displaystyle Nw_{sens}italic_N italic_w start_POSTSUBSCRIPT italic_s italic_e italic_n italic_s end_POSTSUBSCRIPT =WcomplTnorm.absentsubscript𝑊𝑐𝑜𝑚𝑝𝑙subscript𝑇𝑛𝑜𝑟𝑚\displaystyle=\frac{W_{compl}}{T_{norm}}.= divide start_ARG italic_W start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT end_ARG .

A lower value of Nwsens𝑁subscript𝑤𝑠𝑒𝑛𝑠Nw_{sens}italic_N italic_w start_POSTSUBSCRIPT italic_s italic_e italic_n italic_s end_POSTSUBSCRIPT indicates that the job has endured greater network-induced slowdowns, and, consequently, its priority should be increased. Resource offers are hence made in increasing order of Nwsens𝑁subscript𝑤𝑠𝑒𝑛𝑠Nw_{sens}italic_N italic_w start_POSTSUBSCRIPT italic_s italic_e italic_n italic_s end_POSTSUBSCRIPT. This allows jobs with lower Nwsens𝑁subscript𝑤𝑠𝑒𝑛𝑠Nw_{sens}italic_N italic_w start_POSTSUBSCRIPT italic_s italic_e italic_n italic_s end_POSTSUBSCRIPT to have a higher probability of receiving more consolidated resource offers.

4.3.2. Job placement

In a DL cluster, jobs may receive resource offers with varying levels of consolidation, meaning the GPUs offered may be located on a single machine, within a single rack, or spread across the network. Dally employs the well-established delay scheduling strategy to handle these offers. Under this scheme, a job has the option to decline offers that do not meet its consolidation preference as described in Algo. 1. Initially, a job may reject all offers that do not place its GPUs on the same machine until a machine-level delay timer has elapsed while waiting (starving) for resources (line 10 in Algo. 1). Once the machine-level delay timer elapses, the job will commence accepting offers that position all its GPUs within a single rack until the rack-level delay timer has also elapsed (line 18 in Algo. 1). Subsequently, at this point, the job will accept any resource offer without considering consolidation (line 21 in Algo. 1). Note that a job that cannot fit on a single machine will have a machine-level delay timer set to 0, and similarly, a job that cannot fit on a single rack will have both delay timers set to 0.

The underlying principle is to delay the placement of jobs in anticipation of better consolidation options becoming available in the future. This delay-based approach strikes a balance between the high queuing times associated with consolidation and the high communication overhead of non-consolidation. In traditional big data schedulers, these delay timers are typically configured by system administrators and often left at their default values of a few seconds. Based on our empirical observations, we establish a default value of 12 hours for machine-level consolidation and another 12 hours (or 24 hours in total) for rack-level consolidation. Delay timers need to be adjusted based on resource contention in the cluster for optimal delay scheduling. Generally, system administrators adjust timers based on historical contention data of the cluster. But, to eliminate the need for manual timer adjustments, we introduce an ”auto-tuner” that automatically optimizes these delay timers. We elaborate on our auto-tuner in the following section.

4.3.3. Auto-tuner

We introduce an auto-tuner module in Dally aimed at automatically fine-tuning delay timers. This tuner relies on ”moving averages” derived from historical job waiting (starvation) times associated with specific consolidation and GPU demands. The core concept revolves around leveraging historical data to determine an optimal waiting (delay) duration for securing a favorable job consolidation. When a job accepts a resource request, it updates a ”global list” pertaining to the consolidation level (machine or rack) and the job’s GPU demand, with the time it waited for before accepting a resource offer (lines 7 and 15 in Algo. 1). Dally maintains comprehensive lists, tracking job waiting times for each combination of consolidation and GPU demand. In a DL cluster, the GPU demands usually align with multiples of 2, typically ranging from 5 to 10 types. As a result, this approach leads to a manageable 2-20 lists for machine-level and rack-level consolidation paired with GPU demands.

To calculate the moving averages (described in Algo. 2), Dally mandates users to define the size of the wait-time list, denoted as the ”history limit” (sliding window size) for the auto-tuner. The maintenance of the wait-time lists involves discarding values (utilized in average calculation) that exceed the specified history time limit (lines 4 and 10 in Algo. 2). Whenever a job receives a resource offer, the delay timer is calculated by averaging the wait-time list corresponding to the consolidation level (machine or rack) and GPU demand of the job, and then adding two sample standard deviations (similarly estimated using the same wait-time list). This value is then used in determining the optimal duration for a job to wait for a favorable consolidation (lines 13 and 14 in Algo. 2). We present the timeline depicting the adjustment of delay times (rack level) for a typical ¿P95 DDL job in Fig. 5. It is evident from our observations that the auto-tuner modifies delay timers by increasing the value when there is a lower number of racks (and consequently fewer GPUs) in the cluster leading to more resource contention. Over time, the auto-tuner effectively ”learns” the optimal duration to wait as demand delays are updated.

Refer to caption
Figure 5. Auto tuning timeline

Note that it is essential to tailor the history limit according to the cluster size. Specifically, in larger clusters, a smaller history time limit is advised. This is simply because more jobs get placed over time in a large cluster compared to a smaller cluster.

Algorithm 1 On Resource Offer
1:Resource offer R𝑅Ritalic_R
2:Accept or reject offer
3:Tlast_assignmentsubscript𝑇𝑙𝑎𝑠𝑡_𝑎𝑠𝑠𝑖𝑔𝑛𝑚𝑒𝑛𝑡T_{last\_assignment}italic_T start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t _ italic_a italic_s italic_s italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUBSCRIPT = get_last_resource_assignment_time()get_last_resource_assignment_time\mbox{get\_last\_resource\_assignment\_time}()get_last_resource_assignment_time ( )
4:Tstarvation=Tcurr_timeTlast_assigmentsubscript𝑇𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛subscript𝑇𝑐𝑢𝑟𝑟_𝑡𝑖𝑚𝑒subscript𝑇𝑙𝑎𝑠𝑡_𝑎𝑠𝑠𝑖𝑔𝑚𝑒𝑛𝑡T_{starvation}=T_{curr\_time}-T_{last\_assigment}italic_T start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r _ italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t _ italic_a italic_s italic_s italic_i italic_g italic_m italic_e italic_n italic_t end_POSTSUBSCRIPT
5:gd=GPUs demanded by the jobsubscript𝑔𝑑GPUs demanded by the jobg_{d}=\mbox{GPUs demanded by the job}italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = GPUs demanded by the job
6:TMcsubscript𝑇𝑀𝑐T_{Mc}italic_T start_POSTSUBSCRIPT italic_M italic_c end_POSTSUBSCRIPT, TRksubscript𝑇𝑅𝑘T_{Rk}italic_T start_POSTSUBSCRIPT italic_R italic_k end_POSTSUBSCRIPT = get_tuned_timers(gd)get_tuned_timerssubscript𝑔𝑑\mbox{get\_tuned\_timers}(g_{d})get_tuned_timers ( italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
7:if gdGPUs available on a single machine in Rsubscript𝑔𝑑GPUs available on a single machine in 𝑅g_{d}\leq\mbox{GPUs available on a single machine in }Ritalic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≤ GPUs available on a single machine in italic_R then
8:     Allocate gdsubscript𝑔𝑑g_{d}italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT GPUs on machine
9:     update_demand_delay(machine,tstarvation,gd)update_demand_delay𝑚𝑎𝑐𝑖𝑛𝑒subscript𝑡𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛subscript𝑔𝑑\mbox{update\_demand\_delay}(machine,t_{starvation},g_{d})update_demand_delay ( italic_m italic_a italic_c italic_h italic_i italic_n italic_e , italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
10:     return accept resource
11:end if
12:if Tstarvation<TMcsubscript𝑇𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛subscript𝑇𝑀𝑐T_{starvation}<T_{Mc}italic_T start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT italic_M italic_c end_POSTSUBSCRIPT then
13:     return reject offer
14:end if
15:if gdGPUs available on a single rack in Rsubscript𝑔𝑑GPUs available on a single rack in 𝑅g_{d}\leq\mbox{GPUs available on a single rack in }Ritalic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≤ GPUs available on a single rack in italic_R then
16:     Allocate gdsubscript𝑔𝑑g_{d}italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT GPUs on rack
17:     update_demand_delay(rack,tstarvation,gd)update_demand_delay𝑟𝑎𝑐𝑘subscript𝑡𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛subscript𝑔𝑑\mbox{update\_demand\_delay}(rack,t_{starvation},g_{d})update_demand_delay ( italic_r italic_a italic_c italic_k , italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
18:     return accept resource
19:end if
20:if Tstarvation<TRksubscript𝑇𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛subscript𝑇𝑅𝑘T_{starvation}<T_{Rk}italic_T start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT italic_R italic_k end_POSTSUBSCRIPT then
21:     return reject offer
22:end if
23:Allocate gdsubscript𝑔𝑑g_{d}italic_g start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT GPUs on network
24:return accept offer
Algorithm 2 Get Tuned Timers
1:GPU demand Gdsubscript𝐺𝑑G_{d}italic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
2:Tuned delay timers
3:mc_starvation_times𝑚𝑐_𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛_𝑡𝑖𝑚𝑒𝑠mc\_starvation\_timesitalic_m italic_c _ italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n _ italic_t italic_i italic_m italic_e italic_s = get_delay_list(machine,Gd)get_delay_list𝑚𝑎𝑐𝑖𝑛𝑒subscript𝐺𝑑\mbox{get\_delay\_list}(machine,G_{d})get_delay_list ( italic_m italic_a italic_c italic_h italic_i italic_n italic_e , italic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
4:for time in mc_starvation_times do
5:     if time>HISTORY_TIME_LIMITtimeHISTORY_TIME_LIMIT\text{time}>\text{HISTORY\_TIME\_LIMIT}time > HISTORY_TIME_LIMIT then
6:         remove time from mc_starvation_times
7:     end if
8:end for
9:rack_starvation_times𝑟𝑎𝑐𝑘_𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛_𝑡𝑖𝑚𝑒𝑠rack\_starvation\_timesitalic_r italic_a italic_c italic_k _ italic_s italic_t italic_a italic_r italic_v italic_a italic_t italic_i italic_o italic_n _ italic_t italic_i italic_m italic_e italic_s = get_delay_list(rack,Gd)get_delay_list𝑟𝑎𝑐𝑘subscript𝐺𝑑\mbox{get\_delay\_list}(rack,G_{d})get_delay_list ( italic_r italic_a italic_c italic_k , italic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
10:for time in rack_starvation_times do
11:     if time>HISTORY_TIME_LIMITtimeHISTORY_TIME_LIMIT\text{time}>\text{HISTORY\_TIME\_LIMIT}time > HISTORY_TIME_LIMIT then
12:         remove time from rack_starvation_times
13:     end if
14:end for
15:return average(mc_starvation_times)+2×standard_deviation(mc_starvation_times),averagemc_starvation_times2standard_deviationmc_starvation_times\text{average}(\text{mc\_starvation\_times})+2\times\text{standard\_deviation}% (\text{mc\_starvation\_times}),average ( mc_starvation_times ) + 2 × standard_deviation ( mc_starvation_times ) ,
16:                 average(rack_starvation_times)+2×standard_deviation(rack_starvation_times)averagerack_starvation_times2standard_deviationrack_starvation_times\text{average}(\text{rack\_starvation\_times})+2\times\text{standard\_% deviation}(\text{rack\_starvation\_times})average ( rack_starvation_times ) + 2 × standard_deviation ( rack_starvation_times )

5. EVALUATION

In this section, we assess Dally through extensive simulations driven by large-scale traces, following the previously outlined methodology. It is important to emphasize that our evaluations leverage the most recent generation of network hardware, necessitating the use of simulations as the exclusive option.

5.1. Experimental setup

5.1.1. Workloads

In our experiment, we employ real-world production traces obtained from SenseTime. We randomly select 500 DL/DDL jobs for batch and about 400 DL/DDL jobs for Poisson workloads from the trace,333From the original SenseTime trace, we replace EfficientNet with AlexNet for a high skew workload which utilize the models specified in Table 1. The table encompasses a diverse range of network overheads, providing ample coverage for testing a DDL cluster executing various network-sensitive DNN models. The cluster trace furnishes the compute requirements for individual jobs, including the time per iteration, the necessary number of iterations, and the quantity of GPUs demanded. We assume a data-parallel distributed training paradigm.

5.1.2. Cluster setup

In our experiments, we adopt the configuration of a three-tier network data center. Each machine is equipped with eight GPUs interconnected through NVIDIA NVSwitch. Within each rack, eight machines are connected via the NVIDIA Quantum switch, and the racks across the cluster are interlinked through the NVIDIA Spectrum network switch. To set up our network simulations in ASTRA-sim, we rely on the QoS data advertised by NVIDIA (nvidia-quantum, ; nvidia-spectrum, ). Our experimentation involves manipulating the number of racks, specifically using 2, 4, 8, and 16 racks, to observe the impact of varying compute and network resources.

Refer to caption
Figure 6. Batch arrival results: (a) Makespan; (b) Tail queueing delay; (c) Average communication overhead

5.1.3. Baselines

We use the following three baselines to evaluate our scheme:

  1. (1)

    Gandiva: Being network-agnostic, Gandiva, as one of the initial DL cluster schedulers, exhibits sub-optimal performance. It addresses this limitation by migrating jobs to more favorable GPU consolidation whenever resources become available.

  2. (2)

    Tiresias: The state-of-the-art DL cluster scheduler currently employed adopts a strategy of consolidating job placement based on their concept of skew in DNN models. This scheduler possesses partial knowledge of the network, and the consolidation is stringent for skewed DNN models.

  3. (3)

    Dally-manual: This represents a modification of our proposed scheme where delay timers for delay scheduling are manually configured. This aligns with the approach utilized by contemporary state-of-the-art big-data cluster schedulers, such as YARN. Note that the preemption mechanism remains consistent with our proposed scheme Dally with auto-tuning. We use this baseline to demonstrate the effectiveness of our delay timer auto-tuner.

5.1.4. Performance metrics

We define the following metrics to measure performance benefits.

  1. (1)

    Makespan: The time elapsed between arrival of the first job to the completion of the final job in the cluster.

  2. (2)

    JCT: The job completion time (JCT) is the elapsed time between job arrival and completion in the cluster.

  3. (3)

    Queueing delay: Time spent by a job in the wait queue.

  4. (4)

    Communication overhead: Exposed communication time of a job.

5.2. Results

Refer to caption
(a) 2 racks
Refer to caption
(b) 4 racks
Refer to caption
(c) 8 racks
Refer to caption
(d) 16 racks
Figure 7. JCT CDF - Batch arrival
Refer to caption
(a) 2 racks
Refer to caption
(b) 4 racks
Refer to caption
(c) 8 racks
Refer to caption
(d) 16 racks
Figure 8. JCT histogram (P95) - Poisson arrival

5.2.1. Makespan

Fig. 6 (a) illustrates the improvement in Makespan. Notably, Dally exhibits an improvement of up to 69% compared to Tiresias and up to 92% compared to Gandiva. This improvement is attributed to the implementation of delay scheduling, effectively relaxing stringent consolidation, thereby decreasing queueing delay and optimizing JCT by judiciously awaiting resource offers with favorable consolidation. A loosening essentially implies a job being positioned in less favorable network tiers, a scenario which may not be feasible in Tiresias. The manual configuration of delay timers in Dally-manual results in sub-optimal waiting, as it cannot dynamically ”tune” its delay timers based on contention. The static nature of the delay time fails to consider GPU demand and resource contention in the cluster, leading to inferior performance compared to Dally, while still outperforming Tiresias and Gandiva due to reduced queuing delays.

5.2.2. Queueing delay

The primary contributor to the significant improvement in Makespan is the reduction in queueing delays experienced within the cluster. Although Dally reports an average queueing delay improvement ranging from 18% to 31%, depending on the cluster size, the cluster’s Makespan is more notably influenced by the ¿P95 and ¿P99 JCTs (i.e. the 95th and 99th complementary percentiles of JCT distribution). Dally demonstrates a remarkable improvement of 58% – 71% in ¿P95 and 67% – 78% in ¿P99 queueing delay compared to Tiresias (Fig. 6 (b)), which results in the substantial improvement in Makespan. The reduction in queueing delay can be attributed to the less stringent consolidation policy of Dally compared to Tiresias, although not as relaxed as Gandiva’s policy. Yet, Dally improves upon Gandiva in tail (¿P95 and ¿P99) queueing delay by as much as 81%. In fact, Gandiva’s network-agnostic scheduling policy leads to very high communication overhead.

5.2.3. Communication overhead

Dally demonstrates the lowest communication overhead, surpassing the nearest network-aware baseline, Tiresias, by a notable margin ranging from 53% to 83% (Fig. 6 (c)). This is due to Dally’s preemption policy, which prioritizes providing better-consolidated placements to jobs suffering from sub-optimal placements or network sensitivity, thereby mitigating communication overhead. Gandiva exhibits a slight performance difference in Makespan compared to Dally in the initial 2-rack experiment but experiences a substantial degradation as the number of racks increases. This discrepancy arises because the 2-rack experiment involves fewer network connections and consequently fewer network placements. However, with an increase in the number of racks, the cluster experiences a rise in network connections and leads to jobs being placed on the network, resulting in elevated communication overhead, as depicted in Fig. 6 (c). Additionally, this phenomenon results in a long tail in the number of job completions within the cluster, subsequently leading to low cluster (GPU) utilization as depicted in Fig. 9.

Refer to caption
(a) Batch
Refer to caption
(b) Poisson
Figure 9. Cluster GPU utilization (8 racks)
Refer to caption
(a) Batch
Refer to caption
(b) Poisson
Figure 10. Number of jobs remaining (8 racks)

5.2.4. Cluster utilization

We explain this section through results from the 8 racks experiment but we get similar results for the other experiments as well. Fig. 10 illustrates the number of remaining jobs in the cluster, highlighting Gandiva’s notable tailing effect. Gandiva jobs, (including the ¿P95 and ¿P99) experience substantial communication overhead, resulting in diminished cluster utilization, as depicted in Fig. 9. Dally’s efficacy in minimizing communication overhead and queuing delays contributes to the highest job completion rates among all tested schemes as shown in Fig. 10 (a proxy for cluster utilization), along with the lowest JCTs.

5.2.5. Job completion time

To assess the JCT, we utilize traces with both batch and Poisson job arrivals, recognizing that the average JCT can vary significantly based on the job arrival pattern. In Fig. 11(a), we present the average JCTs with batch arrivals, revealing that Dally achieves an improvement in average JCT ranging from 19% to 36% compared to Tiresias and 23% to 51% compared to Gandiva. Examination of the JCT Cumulative Distribution Functions (CDFs) in Fig. 7 indicates that Dally reduces JCT for a majority of the job sizes. Notably, in the 2 racks experiment, Dally experiences a 39% degradation against Gandiva, which performs better with fewer aggregate network placements. However, as the number of racks (and network connections) increases, Gandiva’s performance degrades rapidly due to increased communication overhead. Additional statistics for the JCT in the 8-rack experiment are presented in Table 2 showcasing Dally’s enhancements in tail JCT.

To assess Dally under Poisson arrival conditions, we conduct experiments analogous to those performed with batch arrivals, but with jobs exhibiting typical Poisson arrival times. Dally demonstrates improvements over Tiresias ranging from 16% to 34% and over Gandiva by 23% to 51% as shown in Fig. 11(b). However, in the 4-rack scenario, Dally experiences degradation compared to Tiresias due to the random arrival pattern inherent in a Poisson process. A drawback worth noting is that Tiresias exhibits superior performance in identifying optimal placements for this particular arrival pattern and network size (racks), improving upon its own average JCT in the 8-rack and 16-rack experiments. This indicates that the specific arrival pattern was well suited for Tiresias in the 4-racks case. Also, recall from Sec. 4.3.1 that Dally does not emphasize JCT when prioritizing jobs for GPU consolidation. However, Fig. 8 presents histograms of JCTs (P95) with Poisson arrival times, illustrating Dally’s improvement in the JCTs of the majority of the jobs. Detailed JCT statistics for the 8-rack Poisson experiment are provided in Table 3. We find Dally outperforming Tiresias in both average and median values by 35% and 38% respectively while being within 10% of the tail. Dally also outperforms Gandiva in average and median JCT by 83% and 88% respectively and tail JCT by 76-92%.

Refer to caption
(a) Batch arrival
Refer to caption
(b) Poisson arrival
Figure 11. Average JCT
Gandiva Tiresias Dally-manual Dally
Average 5828719 4461998 3128566 2828955
Median 183717 2547080 2552189 2401390
¿P95 17485823 17841606 7097646 5983042
¿P99 39214710 25696135 8868401 8395214
Table 2. JCT batch statistics in seconds (8 racks)
Gandiva Tiresias Dally-manual Dally
Average 17371514 4329941 4960727 2831880
Median 22722917 4212795 2787855 2595529
¿P95 71362014 15059080 39298828 16469069
¿P99 208939883 15390860 39817762 16681241
Table 3. JCT Poisson statistics in seconds (8 racks)

6. CONCLUSIONS

Training of DDL jobs on GPU clusters is expensive either in a public or private cloud environment. A significant part of this time is due to the communication overhead between a hierarchical organization of GPUs in a datacenter. This communication overhead and the resultant overall training time can be minimized by a prudent allocation of GPUs, where a distributed training job is consolidated on physically proximal GPUs in proportion to its network sensitivity.

In this context, we propose Dally, a cost-effective DL cluster scheduling scheme which employs delay scheduling for job placement and consolidation. Dally incorporates a novel network-sensitive preemption technique to prioritize jobs with high network sensitivity, ensuring they receive more consolidated job offers. Additionally, we propose and integrate an auto-tuner into Dally to optimize delay timers according to varying GPU contention/demand for delay scheduling. To assess the performance of Dally, we introduce ArtISt-sim, a cost-effective DL simulation platform tailored for simulating large DL clusters. Through extensive trace-driven simulations, we demonstrate that Dally can reduce end-to-end makespan time by 69% compared to consolidation based policy, average JCT by up to 83%, and communication overhead by up to 98% under congested networking conditions.

Acknowledgements

This research was supported in part by NSF grants #2116962, #2122155, #2028929.

References

  • [1] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 33–42, 2009.
  • [2] Alibaba PAI. https://github.com/AlibabaPAI, Accessed: 2022.06.15.
  • [3] Analytical network backend. https://github.com/astra-sim/astra-network-analytical, Accessed: 2023.
  • [4] AWS EC2 pricing. https://aws.amazon.com/ec2/pricing/on-demand/, Accessed: 2023.
  • [5] Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20, New York, NY, USA, 2020. Association for Computing Machinery.
  • [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  • [7] Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, page 609–623, New York, NY, USA, 2021. Association for Computing Machinery.
  • [8] Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, page 609–623, New York, NY, USA, 2021. Association for Computing Machinery.
  • [9] GEneral Matrix Multiplication (GEMM). https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, Accessed: 2023.
  • [10] GPU direct RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html, Accessed: 2023.
  • [11] Sorin Mihai Grigorescu, Bogdan Trasnea, Tiberiu T. Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. CoRR, abs/1910.07738, 2019.
  • [12] Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In NSDI), pages 485–500. USENIX Association, 2019.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity map**s in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, 2016.
  • [14] Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.
  • [15] Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 457–472, New York, NY, USA, 2023. Association for Computing Machinery.
  • [16] Infiniband. https://en.wikipedia.org/wiki/InfiniBand, Accessed: 2022.06.08.
  • [17] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In ATC, pages 947–960. USENIX Association, 2019.
  • [18] Mladan Jovanovic and Mark Campbell. Generative artificial intelligence: Trends and prospects. Computer, 55(10):107–112, 2022.
  • [19] M. R. Siavash Katebzadeh, Paolo Costa, and Boris Grot. Evaluation of an infiniband switch: Choose latency or bandwidth, but not both. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 180–191, 2020.
  • [20] T. Khan, S. Rashidi, S. Sridharan, P. Shurpali, A. Akella, and T. Krishna. Impact of RoCE Congestion Control Policies on Distributed Training of DNNs. In Proc. IEEE Symp. on High-Performance Interconnects, July 2022.
  • [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.
  • [22] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE TPDS, 31(1):94–110, 2019.
  • [23] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020.
  • [24] Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, Santa Clara, CA, February 2020. USENIX Association.
  • [25] Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. Kungfu: Making training in distributed machine learning adaptive. In OSDI, pages 937–954. USENIX Association, 2020.
  • [26] Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark. MLSys, 2:336–349, 2020.
  • [27] Machine Learning as a Service (MLaaS). https://levity.ai/blog/mlaas-platforms-comparative-guide, Accessed: 2023.
  • [28] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proc. ACM SOSP, 2019.
  • [29] Misja Nuyens and Adam Wierman. The foreground–background queue: a survey. Performance evaluation, 65(3-4):286–307, 2008.
  • [30] NVidia Quantum switch. https://nvdam.widen.net/s/k8sqcr6gzb/infiniband-quantum-2-qm9700-series-datasheet-us-nvidia-1751454-r8-web, Accessed: 2023.
  • [31] NVidia Spectrum switch. https://nvdam.widen.net/s/mmvbnpk8qk/networking-ethernet-switches-sn5000-datasheet-us, Accessed: 2023.
  • [32] NVlink. https://www.nvidia.com/en-us/data-center/nvlink/, Accessed: 2023.
  • [33] Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco GB De Natale. Deep learning for mobile multimedia: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–22, 2017.
  • [34] Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
  • [35] Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 1–18. USENIX Association, July 2021.
  • [36] Saeed Rashidi, Pallavi Shurpali, Srinivas Sridharan, Naader Hassani, Dheevatsa Mudigere, Krishnakumar Nair, Misha Smelyanski, and Tushar Krishna. Scalable distributed training of recommendation models: An ASTRA-sim+ ns3 case-study with TCP/IP transport. In Proc. IEEE HOTI, pages 33–42, 2020.
  • [37] Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. ASTRA-sim: Enabling SW/HW co-design exploration for distributed DL training platforms. In Proc. IEEE ISPASS, pages 81–92, 2020.
  • [38] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. Themis: A network bandwidth-aware collective scheduling policy for distributed training of DL models. In Proc.ACM/IEEE ISCA, pages 581–596, 2022.
  • [39] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [40] Aakash Sharma, Vivek M. Bhasi, Sonali Singh, Rishabh Jain, Jashwant Raj Gunasekaran, Subrata Mitra, Mahmut Taylan Kandemir, George Kesidis, and Chita R. Das. Stash: A comprehensive stall-centric characterization of public cloud vms for distributed deep learning. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), pages 1–12, 2023.
  • [41] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  • [42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, San Diego, CA, May 7-9, 2015.
  • [43] Apache Spark. https://spark.apache.org/.
  • [44] TPU. https://en.wikipedia.org/wiki/Tensor_Processing_Unit, Accessed: 2022.06.15.
  • [45] William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. https://arxiv.longhoe.net/abs/2303.14006, 2023.
  • [46] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In OSDI, pages 595–610. USENIX Association, 2018.
  • [47] Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548. USENIX Association, November 2020.
  • [48] Apache Hadoop YARN. https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html.
  • [49] Gingfung Yeung, Damian Borowiec, Renyu Yang, Adrian Friday, Richard Harper, and Peter Garraghan. Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Transactions on Parallel and Distributed Systems, 33(1):88–100, 2022.
  • [50] Matei Zaharia, Dhruba Borthakur, Joydeep Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys’10 - Proceedings of the EuroSys 2010 Conference, pages 265–278, 01 2010.