-
How to Rent GPUs on a Budget
Authors:
Zhouzi Li,
Benjamin Berg,
Arpan Mukhopadhyay,
Mor Harchol-Balter
Abstract:
The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a us…
▽ More
The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a user must specify their demand for GPUs at every moment in time, and will pay for every GPU-hour they use. ML training jobs are known to be parallelizable to different degrees. Given a stream of ML training jobs, a user typically wants to minimize the mean response time across all jobs. Here, the response time of a job denotes the time from when a job arrives until it is complete. Additionally, the user is constrained by some operating budget. Specifically, in this paper the user is constrained to use no more than $b$ GPUs per hour, over a long-run time average. The question is how to minimize mean response time while meeting the budget constraint. Because training jobs receive a diminishing marginal benefit from running on additional GPUs, allocating too many GPUs to a single training job can dramatically increase the overall cost paid by the user. Hence, an optimal rental policy must balance a tradeoff between training cost and mean response time. This paper derives the optimal rental policy for a stream of training jobs where the jobs have different levels of parallelizability (specified by a speedup function) and different job sizes (amounts of inherent work). We make almost no assumptions about the arrival process and about the job size distribution. Our optimal policy specifies how many GPUs to rent at every moment in time and how to allocate these GPUs.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Analysis of Markovian Arrivals and Service with Applications to Intermittent Overload
Authors:
Isaac Grosof,
Yige Hong,
Mor Harchol-Balter
Abstract:
Almost all queueing analysis assumes i.i.d. arrivals and service. In reality, arrival and service rates fluctuate over time. In particular, it is common for real systems to intermittently experience overload, where the arrival rate temporarily exceeds the service rate, which an i.i.d. model cannot capture. We consider the MAMS system, where the arrival and service rates each vary according to an a…
▽ More
Almost all queueing analysis assumes i.i.d. arrivals and service. In reality, arrival and service rates fluctuate over time. In particular, it is common for real systems to intermittently experience overload, where the arrival rate temporarily exceeds the service rate, which an i.i.d. model cannot capture. We consider the MAMS system, where the arrival and service rates each vary according to an arbitrary finite-state Markov chain, allowing intermittent overload to be modeled.
We derive the first explicit characterization of mean queue length in the MAMS system, with explicit bounds for all arrival and service chains at all loads. Our bounds are tight in heavy traffic. We prove even stronger bounds for the important special case of two-level arrivals with intermittent overload.
Our key contribution is an extension to the drift method, based on the novel concepts of relative arrivals and relative completions. These quantities allow us to tractably capture the transient correlational effect of the arrival and service processes on the mean queue length.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Can Increasing the Hit Ratio Hurt Cache Throughput?
Authors:
Ziyue Qiu,
Juncheng Yang,
Mor Harchol-Balter
Abstract:
Software caches are an intrinsic component of almost every computer system. Consequently, caching algorithms, particularly eviction policies, are the topic of many papers. Almost all these prior papers evaluate the caching algorithm based on its hit ratio, namely the fraction of requests that are found in the cache, as opposed to disk. The hit ratio is viewed as a proxy for traditional performance…
▽ More
Software caches are an intrinsic component of almost every computer system. Consequently, caching algorithms, particularly eviction policies, are the topic of many papers. Almost all these prior papers evaluate the caching algorithm based on its hit ratio, namely the fraction of requests that are found in the cache, as opposed to disk. The hit ratio is viewed as a proxy for traditional performance metrics like system throughput or response time. Intuitively it makes sense that higher hit ratio should lead to higher throughput (and lower response time), since more requests are found in the cache (low access time) as opposed to the disk (high access time).
This paper challenges this intuition. We show that increasing the hit ratio can actually hurt the throughput (and response time) for many caching algorithms. Our investigation follows a three-pronged approach involving (i) queueing modeling and analysis, (ii) implementation and measurement, and (iii) simulation to validate the accuracy of the queueing model. We also show that the phenomenon of throughput decreasing at higher hit ratios is likely to be more pronounced in future systems, where the trend is towards faster disks and higher numbers of cores per CPU.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Asymptotically Optimal Scheduling of Multiple Parallelizable Job Classes
Authors:
Benjamin Berg,
Benjamin Moseley,
Weina Wang,
Mor Harchol-Balter
Abstract:
Many modern computing workloads are composed of parallelizable jobs. A single parallelizable job can be completed more quickly if it is run on additional servers, however each job is typically limited in the number of servers it can run on (its parallelizability level). A job's parallelizability level is determined by the type of computation the job performs and how it was implemented. As a result…
▽ More
Many modern computing workloads are composed of parallelizable jobs. A single parallelizable job can be completed more quickly if it is run on additional servers, however each job is typically limited in the number of servers it can run on (its parallelizability level). A job's parallelizability level is determined by the type of computation the job performs and how it was implemented. As a result, a single workload of parallelizable jobs generally consists of multiple $\textit{job classes}$, where jobs from different classes may have different parallelizability levels. The inherent sizes of jobs from different classes may also be vastly different.
This paper considers the important, practical problem of how to schedule an arbitrary number of classes of parallelizable jobs. Here, each class of jobs has an associated job size distribution and parallelizability level. Given a limited number of servers, $k$, we ask how to allocate the $k$ servers across a stream of arriving jobs in order to minimize the $\textit{mean response time}$ -- the average time from when a job arrives to the system until it is completed.
The problem of optimal scheduling in multiserver systems is known to be difficult, even when jobs are not parallelizable. To solve the harder problem of scheduling multiple classes of parallelizable jobs, we turn to asymptotic scaling regimes. We find that in lighter-load regimes (i.e., Sub-Halfin-Whitt), the optimal allocation algorithm is Least-Parallelizable-First (LPF), a policy that prioritizes jobs from the least parallelizable job classes. By contrast, we also find that in the heavier-load regimes (i.e., Super-NDS), the optimal allocation algorithm prioritizes the jobs with the Shortest Expected Remaining Processing Time (SERPT). We also develop scheduling policies that perform optimally when the scaling regime is not known to the system a priori.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
The RESET and MARC Techniques, with Application to Multiserver-Job Analysis
Authors:
Isaac Grosof,
Yige Hong,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
Multiserver-job (MSJ) systems, where jobs need to run concurrently across many servers, are increasingly common in practice. The default service ordering in many settings is First-Come First-Served (FCFS) service. Virtually all theoretical work on MSJ FCFS models focuses on characterizing the stability region, with almost nothing known about mean response time.
We derive the first explicit chara…
▽ More
Multiserver-job (MSJ) systems, where jobs need to run concurrently across many servers, are increasingly common in practice. The default service ordering in many settings is First-Come First-Served (FCFS) service. Virtually all theoretical work on MSJ FCFS models focuses on characterizing the stability region, with almost nothing known about mean response time.
We derive the first explicit characterization of mean response time in the MSJ FCFS system. Our formula characterizes mean response time up to an additive constant, which becomes negligible as arrival rate approaches throughput, and allows for general phase-type job durations.
We derive our result by utilizing two key techniques: REduction to Saturated for Expected Time (RESET) and MArkovian Relative Completions (MARC).
Using our novel RESET technique, we reduce the problem of characterizing mean response time in the MSJ FCFS system to an M/M/1 with Markovian service rate (MMSR). The Markov chain controlling the service rate is based on the saturated system, a simpler closed system which is far more analytically tractable.
Unfortunately, the MMSR has no explicit characterization of mean response time. We therefore use our novel MARC technique to give the first explicit characterization of mean response time in the MMSR, again up to constant additive error. We specifically introduce the concept of "relative completions," which is the cornerstone of our MARC technique.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Optimal Scheduling in the Multiserver-job Model under Heavy Traffic
Authors:
Isaac Grosof,
Ziv Scully,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. In simpler settings, such as various known-size single-server-job settings, minimizing mean response time is merely a matter of prioritizing…
▽ More
Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. In simpler settings, such as various known-size single-server-job settings, minimizing mean response time is merely a matter of prioritizing small jobs. However, for the multiserver-job system, prioritizing small jobs is not enough, because we must also ensure servers are not unnecessarily left idle. Thus, minimizing mean response time requires prioritizing small jobs while simultaneously maximizing throughput. Our question is how to achieve these joint objectives.
We devise the ServerFilling-SRPT scheduling policy, which is the first policy to minimize mean response time in the multiserver-job model in the heavy traffic limit. In addition to proving this heavy-traffic result, we present empirical evidence that ServerFilling-SRPT outperforms all existing scheduling policies for all loads, with improvements by orders of magnitude at higher loads.
Because ServerFilling-SRPT requires knowing job sizes, we also define the ServerFilling-Gittins policy, which is optimal when sizes are unknown or partially known.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
The Gittins Policy in the M/G/1 Queue
Authors:
Ziv Scully,
Mor Harchol-Balter
Abstract:
The Gittins policy is a highly general scheduling policy that minimizes a wide variety of mean holding cost metrics in the M/G/1 queue. Perhaps most famously, Gittins minimizes mean response time in the M/G/1 when jobs' service times are unknown to the scheduler. Gittins also minimizes weighted versions of mean response time. For example, the well-known "$cμ$ rule", which minimizes class-weighted…
▽ More
The Gittins policy is a highly general scheduling policy that minimizes a wide variety of mean holding cost metrics in the M/G/1 queue. Perhaps most famously, Gittins minimizes mean response time in the M/G/1 when jobs' service times are unknown to the scheduler. Gittins also minimizes weighted versions of mean response time. For example, the well-known "$cμ$ rule", which minimizes class-weighted mean response time in the multiclass M/M/1, is a special case of Gittins.
However, despite the extensive literature on Gittins in the M/G/1, it contains no fully general proof of Gittins's optimality. This is because Gittins was originally developed for the multi-armed bandit problem. Translating arguments from the multi-armed bandit to the M/G/1 is technically demanding, so it has only been done rigorously in some special cases. The extent of Gittins's optimality in the M/G/1 is thus not entirely clear.
In this work we provide the first fully general proof of Gittins's optimality in the M/G/1. The optimality result we obtain is even more general than was previously known. For example, we show that Gittins minimizes mean slowdown in the M/G/1 with unknown or partially known service times, and we show that Gittins's optimality holds under batch arrivals. Our proof uses a novel approach that works directly with the M/G/1, avoiding the difficulties of translating from the multi-armed bandit problem.
△ Less
Submitted 20 November, 2021;
originally announced November 2021.
-
How to Schedule Near-Optimally under Real-World Constraints
Authors:
Ziv Scully,
Mor Harchol-Balter
Abstract:
Scheduling is a critical part of practical computer systems, and scheduling has also been extensively studied from a theoretical perspective. Unfortunately, there is a gap between theory and practice, as the optimal scheduling policies presented by theory can be difficult or impossible to perfectly implement in practice. In this work, we use recent breakthroughs in queueing theory to begin to brid…
▽ More
Scheduling is a critical part of practical computer systems, and scheduling has also been extensively studied from a theoretical perspective. Unfortunately, there is a gap between theory and practice, as the optimal scheduling policies presented by theory can be difficult or impossible to perfectly implement in practice. In this work, we use recent breakthroughs in queueing theory to begin to bridge this gap. We show how to translate theoretically optimal policies -- which provably minimize mean response time (a.k.a. latency) -- into near-optimal policies that are easily implemented in practical settings. Specifically, we handle the following real-world constraints:
- We show how to schedule in systems where job sizes (a.k.a. running time) are unknown, or only partially known. We do so using simple policies that achieve performance very close to the much more complicated theoretically optimal policies.
- We show how to schedule in systems that have only a limited number of priority levels available. We show how to adapt theoretically optimal policies to this constrained setting and determine how many levels we need for near-optimal performance.
- We show how to schedule in systems where job preemption can only happen at specific checkpoints. Adding checkpoints allows for smarter scheduling, but each checkpoint incurs time overhead. We give a rule of thumb that near-optimally balances this tradeoff.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Computing the Death Rate of COVID-19
Authors:
Naveen Pai,
Sean Zhang,
Mor Harchol-Balter
Abstract:
The Infection Fatality Rate (IFR) of COVID-19 is difficult to estimate because the number of infections is unknown and there is a lag between each infection and the potentially subsequent death. We introduce a new approach for estimating the IFR by first estimating the entire sequence of daily infections. Unlike prior approaches, we incorporate existing data on the number of daily COVID-19 tests i…
▽ More
The Infection Fatality Rate (IFR) of COVID-19 is difficult to estimate because the number of infections is unknown and there is a lag between each infection and the potentially subsequent death. We introduce a new approach for estimating the IFR by first estimating the entire sequence of daily infections. Unlike prior approaches, we incorporate existing data on the number of daily COVID-19 tests into our estimation; knowing the test rates helps us estimate the ratio between the number of cases and the number of infections. Also unlike prior approaches, rather than determining a constant lag from studying a group of patients, we treat the lag as a random variable, whose parameters we determine empirically by fitting our infections sequence to the sequence of deaths. Our approach allows us to narrow our estimation to smaller time intervals in order to observe how the IFR changes over time. We analyze a 250 day period starting on March 1, 2020. We estimate that the IFR in the U.S. decreases from a high of $0.68\%$ down to $0.24\%$ over the course of this time period. We also provide IFR and lag estimates for Italy, Denmark, and the Netherlands, all of which also exhibit decreasing IFRs but to different degrees.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
WCFS: A new framework for analyzing multiserver systems
Authors:
Isaac Grosof,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
Multiserver queueing systems are found at the core of a wide variety of practical systems. Many important multiserver models have a previously-unexplained similarity: identical mean response time behavior is empirically observed in the heavy traffic limit. We explain this similarity for the first time.
We do so by introducing the work-conserving finite-skip (WCFS) framework, which encompasses a…
▽ More
Multiserver queueing systems are found at the core of a wide variety of practical systems. Many important multiserver models have a previously-unexplained similarity: identical mean response time behavior is empirically observed in the heavy traffic limit. We explain this similarity for the first time.
We do so by introducing the work-conserving finite-skip (WCFS) framework, which encompasses a broad class of important models. This class includes the heterogeneous M/G/k, the limited processor sharing policy for the M/G/1, the threshold parallelism model, and the multiserver-job model under a novel scheduling algorithm.
We prove that for all WCFS models, scaled mean response time $E[T](1-ρ)$ converges to the same value, $E[S^2]/(2E[S])$, in the heavy-traffic limit, which is also the heavy traffic limit for the M/G/1/FCFS. Moreover, we prove additively tight bounds on mean response time for the WCFS class, which hold for all load $ρ$. For each of the four models mentioned above, our bounds are the first known bounds on mean response time.
△ Less
Submitted 12 June, 2022; v1 submitted 26 September, 2021;
originally announced September 2021.
-
Nudge: Stochastically Improving upon FCFS
Authors:
Isaac Grosof,
Kunhe Yang,
Ziv Scully,
Mor Harchol-Balter
Abstract:
The First-Come First-Served (FCFS) scheduling policy is the most popular scheduling algorithm used in practice. Furthermore, its usage is theoretically validated: for light-tailed job size distributions, FCFS has weakly optimal asymptotic tail of response time. But what if we don't just care about the asymptotic tail? What if we also care about the 99th percentile of response time, or the fraction…
▽ More
The First-Come First-Served (FCFS) scheduling policy is the most popular scheduling algorithm used in practice. Furthermore, its usage is theoretically validated: for light-tailed job size distributions, FCFS has weakly optimal asymptotic tail of response time. But what if we don't just care about the asymptotic tail? What if we also care about the 99th percentile of response time, or the fraction of jobs that complete in under one second? Is FCFS still best? Outside of the asymptotic regime, only loose bounds on the tail of FCFS are known, and optimality is completely open.
In this paper, we introduce a new policy, Nudge, which is the first policy to provably stochastically improve upon FCFS. We prove that Nudge simultaneously improves upon FCFS at every point along the tail, for light-tailed job size distributions. As a result, Nudge outperforms FCFS for every moment and every percentile of response time. Moreover, Nudge provides a multiplicative improvement over FCFS in the asymptotic tail. This resolves a long-standing open problem by showing that, counter to previous conjecture, FCFS is not strongly asymptotically optimal.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Zero Queueing for Multi-Server Jobs
Authors:
Weina Wang,
Qiaomin Xie,
Mor Harchol-Balter
Abstract:
Cloud computing today is dominated by multi-server jobs. These are jobs that request multiple servers simultaneously and hold onto all of these servers for the duration of the job. Multi-server jobs add a lot of complexity to the traditional one-job-per-server model: an arrival might not "fit" into the available servers and might have to queue, blocking later arrivals and leaving servers idle. Fro…
▽ More
Cloud computing today is dominated by multi-server jobs. These are jobs that request multiple servers simultaneously and hold onto all of these servers for the duration of the job. Multi-server jobs add a lot of complexity to the traditional one-job-per-server model: an arrival might not "fit" into the available servers and might have to queue, blocking later arrivals and leaving servers idle. From a queueing perspective, almost nothing is understood about multi-server job queueing systems; even understanding the exact stability region is a very hard problem.
In this paper, we investigate a multi-server job queueing model under scaling regimes where the number of servers in the system grows. Specifically, we consider a system with multiple classes of jobs, where jobs from different classes can request different numbers of servers and have different service time distributions, and jobs are served in first-come-first-served order. The multi-server job model opens up new scaling regimes where both the number of servers that a job needs and the system load scale with the total number of servers. Within these scaling regimes, we derive the first results on stability, queueing probability, and the transient analysis of the number of jobs in the system for each class. In particular we derive sufficient conditions for zero queueing. Our analysis introduces a novel way of extracting information from the Lyapunov drift, which can be applicable to a broader scope of problems in queueing systems.
△ Less
Submitted 4 February, 2021; v1 submitted 20 November, 2020;
originally announced November 2020.
-
heSRPT: Parallel Scheduling to Minimize Mean Slowdown
Authors:
Benjamin Berg,
Rein Vesilo,
Mor Harchol-Balter
Abstract:
Modern data centers serve workloads which are capable of exploiting parallelism. When a job parallelizes across multiple servers it will complete more quickly, but jobs receive diminishing returns from being allocated additional servers. Because allocating multiple servers to a single job is inefficient, it is unclear how best to allocate a fixed number of servers between many parallelizable jobs.…
▽ More
Modern data centers serve workloads which are capable of exploiting parallelism. When a job parallelizes across multiple servers it will complete more quickly, but jobs receive diminishing returns from being allocated additional servers. Because allocating multiple servers to a single job is inefficient, it is unclear how best to allocate a fixed number of servers between many parallelizable jobs. This paper provides the first optimal allocation policy for minimizing the mean slowdown of parallelizable jobs of known size when all jobs are present at time 0. Our policy provides a simple closed form formula for the optimal allocations at every moment in time. Minimizing mean slowdown usually requires favoring short jobs over long ones (as in the SRPT policy). However, because parallelizable jobs have sublinear speedup functions, system efficiency is also an issue. System efficiency is maximized by giving equal allocations to all jobs and thus competes with the goal of prioritizing small jobs. Our optimal policy, high-efficiency SRPT (heSRPT), balances these competing goals. heSRPT completes jobs according to their size order, but maintains overall system efficiency by allocating some servers to each job at every moment in time. Our results generalize to also provide the optimal allocation policy with respect to mean flow time. Finally, we consider the online case where jobs arrive to the system over time. While optimizing mean slowdown in the online setting is even more difficult, we find that heSRPT provides an excellent heuristic policy for the online setting. In fact, our simulations show that heSRPT significantly outperforms state-of-the-art allocation policies for parallelizable jobs.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Stability for Two-class Multiserver-job Systems
Authors:
Isaac Grosof,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Much is known in the drop** setting, where jobs are immediately discarded if they require more servers than are currently available. However, very little is known in the more practical setting where jobs queue instead.
In this paper, we derive a closed-form analytical expression for the st…
▽ More
Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Much is known in the drop** setting, where jobs are immediately discarded if they require more servers than are currently available. However, very little is known in the more practical setting where jobs queue instead.
In this paper, we derive a closed-form analytical expression for the stability region of a two-class (non-drop**) multiserver-job system where each class of jobs requires a distinct number of servers and requires a distinct exponential distribution of service time, and jobs are served in first-come-first-served (FCFS) order. This is the first result of any kind for an FCFS multiserver-job system where the classes have distinct service distributions. Our work is based on a technique that leverages the idea of a "saturated" system, in which an unlimited number of jobs are always available.
Our analytical formula provides insight into the behavior of FCFS multiserver-job systems, highlighting the huge wastage (idle servers while jobs are in the queue) that can occur, as well as the nonmonotonic effects of the service rates on wastage.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Optimal Resource Allocation for Elastic and Inelastic Jobs
Authors:
Benjamin Berg,
Mor Harchol-Balter,
Benjamin Moseley,
Weina Wang,
Justin Whitehouse
Abstract:
Modern data centers are tasked with processing heterogeneous workloads consisting of various classes of jobs. These classes differ in their arrival rates, size distributions, and job parallelizability. With respect to paralellizability, some jobs are elastic, meaning they can parallelize linearly across many servers. Other jobs are inelastic, meaning they can only run on a single server. Although…
▽ More
Modern data centers are tasked with processing heterogeneous workloads consisting of various classes of jobs. These classes differ in their arrival rates, size distributions, and job parallelizability. With respect to paralellizability, some jobs are elastic, meaning they can parallelize linearly across many servers. Other jobs are inelastic, meaning they can only run on a single server. Although job classes can differ drastically, they are typically forced to share a single cluster. When sharing a cluster among heterogeneous jobs, one must decide how to allocate servers to each job at every moment in time. In this paper, we design and analyze allocation policies which aim to minimize the mean response time across jobs, where a job's response time is the time from when it arrives until it completes.
We model this problem in a stochastic setting where each job may be elastic or inelastic. Job sizes are drawn from exponential distributions, but are unknown to the system. We show that, in the common case where elastic jobs are larger on average than inelastic jobs, the optimal allocation policy is Inelastic-First, giving inelastic jobs preemptive priority over elastic jobs. We obtain this result by introducing a novel sample path argument. We also show that there exist cases where Elastic-First (giving priority to elastic jobs) performs better than Inelastic-First. We then provide the first analysis of mean response time under both Elastic-First and Inelastic-First by leveraging recent techniques for solving high-dimensional Markov chains.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
Optimal Multiserver Scheduling with Unknown Job Sizes in Heavy Traffic
Authors:
Ziv Scully,
Isaac Grosof,
Mor Harchol-Balter
Abstract:
We consider scheduling to minimize mean response time of the M/G/k queue with unknown job sizes. In the single-server case, the optimal policy is the Gittins policy, but it is not known whether Gittins or any other policy is optimal in the multiserver case. Exactly analyzing the M/G/k under any scheduling policy is intractable, and Gittins is a particularly complicated policy that is hard to analy…
▽ More
We consider scheduling to minimize mean response time of the M/G/k queue with unknown job sizes. In the single-server case, the optimal policy is the Gittins policy, but it is not known whether Gittins or any other policy is optimal in the multiserver case. Exactly analyzing the M/G/k under any scheduling policy is intractable, and Gittins is a particularly complicated policy that is hard to analyze even in the single-server case.
In this work we introduce monotonic Gittins (M-Gittins), a new variation of the Gittins policy, and show that it minimizes mean response time in the heavy-traffic M/G/k for a wide class of finite-variance job size distributions. We also show that the monotonic shortest expected remaining processing time (M-SERPT) policy, which is simpler than M-Gittins, is a 2-approximation for mean response time in the heavy traffic M/G/k under similar conditions. These results constitute the most general optimality results to date for the M/G/k with unknown job sizes. Our techniques build upon work by Grosof et al., who study simple policies, such as SRPT, in the M/G/k; Bansal et al., Kamphorst and Zwart, and Lin et al., who analyze mean response time scaling of simple policies in the heavy-traffic M/G/1; and Aalto et al. and Scully et al., who characterize and analyze the Gittins policy in the M/G/1.
△ Less
Submitted 26 October, 2020; v1 submitted 30 March, 2020;
originally announced March 2020.
-
Simple Near-Optimal Scheduling for the M/G/1
Authors:
Ziv Scully,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
We consider the problem of preemptively scheduling jobs to minimize mean response time of an M/G/1 queue. When we know each job's size, the shortest remaining processing time (SRPT) policy is optimal. Unfortunately, in many settings we do not have access to each job's size. Instead, we know only the job size distribution. In this setting the Gittins policy is known to minimize mean response time,…
▽ More
We consider the problem of preemptively scheduling jobs to minimize mean response time of an M/G/1 queue. When we know each job's size, the shortest remaining processing time (SRPT) policy is optimal. Unfortunately, in many settings we do not have access to each job's size. Instead, we know only the job size distribution. In this setting the Gittins policy is known to minimize mean response time, but its complex priority structure can be computationally intractable. A much simpler alternative to Gittins is the shortest expected remaining processing time (SERPT) policy. While SERPT is a natural extension of SRPT to unknown job sizes, it is unknown whether or not SERPT is close to optimal for mean response time.
We present a new variant of SERPT called monotonic SERPT (M-SERPT) which is as simple as SERPT but has provably near-optimal mean response time at all loads for any job size distribution. Specifically, we prove the mean response time ratio between M-SERPT and Gittins is at most 3 for load $ρ\leq 8/9$ and at most 5 for any load. This makes M-SERPT the only non-Gittins scheduling policy known to have a constant-factor approximation ratio for mean response time.
△ Less
Submitted 22 January, 2020; v1 submitted 24 July, 2019;
originally announced July 2019.
-
Load Balancing Guardrails: Kee** Your Heavy Traffic on the Road to Low Response Times
Authors:
Isaac Grosof,
Ziv Scully,
Mor Harchol-Balter
Abstract:
Load balancing systems, comprising a central dispatcher and a scheduling policy at each server, are widely used in practice, and their response time has been extensively studied in the theoretical literature. While much is known about the scenario where the scheduling at the servers is First-Come-First-Served (FCFS), to minimize mean response time we must use Shortest-Remaining-Processing-Time (SR…
▽ More
Load balancing systems, comprising a central dispatcher and a scheduling policy at each server, are widely used in practice, and their response time has been extensively studied in the theoretical literature. While much is known about the scenario where the scheduling at the servers is First-Come-First-Served (FCFS), to minimize mean response time we must use Shortest-Remaining-Processing-Time (SRPT) scheduling at the servers. Much less is known about dispatching polices when SRPT scheduling is used. Unfortunately, traditional dispatching policies that are used in practice in systems with FCFS servers often have poor performance in systems with SRPT servers. In this paper, we devise a simple fix that can be applied to any dispatching policy. This fix, called guardrails, ensures that the dispatching policy yields optimal mean response time under heavy traffic when used in a system with SRPT servers. Any dispatching policy, when augmented with guardrails, becomes heavy-traffic optimal. Our results yield the first analytical bounds on mean response time for load balancing systems with SRPT scheduling at the servers.
△ Less
Submitted 9 May, 2019;
originally announced May 2019.
-
heSRPT: Optimal Parallel Scheduling of Jobs With Known Sizes
Authors:
Benjamin Berg,
Rein Vesilo,
Mor Harchol-Balter
Abstract:
When parallelizing a set of jobs across many servers, one must balance a trade-off between granting priority to short jobs and maintaining the overall efficiency of the system. When the goal is to minimize the mean flow time of a set of jobs, it is usually the case that one wants to complete short jobs before long jobs. However, since jobs usually cannot be parallelized with perfect efficiency, gr…
▽ More
When parallelizing a set of jobs across many servers, one must balance a trade-off between granting priority to short jobs and maintaining the overall efficiency of the system. When the goal is to minimize the mean flow time of a set of jobs, it is usually the case that one wants to complete short jobs before long jobs. However, since jobs usually cannot be parallelized with perfect efficiency, granting strict priority to the short jobs can result in very low system efficiency which in turn hurts the mean flow time across jobs. In this paper, we derive the optimal policy for allocating servers to jobs at every moment in time in order to minimize mean flow time across jobs. We assume that jobs follow a sublinear, concave speedup function, and hence jobs experience diminishing returns from being allocated additional servers. We show that the optimal policy, heSRPT, will complete jobs according to their size order, but maintains overall system efficiency by allocating some servers to each job at every moment in time. We compare heSRPT with state-of-the-art allocation policies from the literature and show that heSRPT outperforms its competitors by at least 30%, and often by much more.
△ Less
Submitted 20 November, 2020; v1 submitted 21 March, 2019;
originally announced March 2019.
-
SRPT for Multiserver Systems
Authors:
Isaac Grosof,
Ziv Scully,
Mor Harchol-Balter
Abstract:
The Shortest Remaining Processing Time (SRPT) scheduling policy and its variants have been extensively studied in both theoretical and practical settings. While beautiful results are known for single-server SRPT, much less is known for multiserver SRPT. In particular, stochastic analysis of the M/G/k under multiserver SRPT is entirely open. Intuition suggests that multiserver SRPT should be optima…
▽ More
The Shortest Remaining Processing Time (SRPT) scheduling policy and its variants have been extensively studied in both theoretical and practical settings. While beautiful results are known for single-server SRPT, much less is known for multiserver SRPT. In particular, stochastic analysis of the M/G/k under multiserver SRPT is entirely open. Intuition suggests that multiserver SRPT should be optimal or near-optimal for minimizing mean response time. However, the only known analysis of multiserver SRPT is in the worst-case adversarial setting, where SRPT can be far from optimal. In this paper, we give the first stochastic analysis bounding mean response time of the M/G/k under multiserver SRPT. Using our response time bound, we show that multiserver SRPT has asymptotically optimal mean response time in the heavy-traffic limit. The key to our bounds is a strategic combination of stochastic and worst-case techniques. Beyond SRPT, we prove similar response time bounds and optimality results for several other multiserver scheduling policies.
△ Less
Submitted 19 May, 2018;
originally announced May 2018.
-
Optimal Scheduling and Exact Response Time Analysis for Multistage Jobs
Authors:
Ziv Scully,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
Scheduling to minimize mean response time in an M/G/1 queue is a classic problem. The problem is usually addressed in one of two scenarios. In the perfect-information scenario, the scheduler knows each job's exact size, or service requirement. In the zero-information scenario, the scheduler knows only each job's size distribution. The well-known shortest remaining processing time (SRPT) policy is…
▽ More
Scheduling to minimize mean response time in an M/G/1 queue is a classic problem. The problem is usually addressed in one of two scenarios. In the perfect-information scenario, the scheduler knows each job's exact size, or service requirement. In the zero-information scenario, the scheduler knows only each job's size distribution. The well-known shortest remaining processing time (SRPT) policy is optimal in the perfect-information scenario, and the more complex Gittins policy is optimal in the zero-information scenario.
In real systems the scheduler often has partial but incomplete information about each job's size. We introduce a new job model, that of multistage jobs, to capture this partial-information scenario. A multistage job consists of a sequence of stages, where both the sequence of stages and stage sizes are unknown, but the scheduler always knows which stage of a job is in progress. We give an optimal algorithm for scheduling multistage jobs in an M/G/1 queue and an exact response time analysis of our algorithm.
△ Less
Submitted 12 November, 2018; v1 submitted 17 May, 2018;
originally announced May 2018.
-
SOAP: One Clean Analysis of All Age-Based Scheduling Policies
Authors:
Ziv Scully,
Mor Harchol-Balter,
Alan Scheller-Wolf
Abstract:
We consider an extremely broad class of M/G/1 scheduling policies called SOAP: Schedule Ordered by Age-based Priority. The SOAP policies include almost all scheduling policies in the literature as well as an infinite number of variants which have never been analyzed, or maybe not even conceived. SOAP policies range from classic policies, like first-come, first-serve (FCFS), foreground-background (…
▽ More
We consider an extremely broad class of M/G/1 scheduling policies called SOAP: Schedule Ordered by Age-based Priority. The SOAP policies include almost all scheduling policies in the literature as well as an infinite number of variants which have never been analyzed, or maybe not even conceived. SOAP policies range from classic policies, like first-come, first-serve (FCFS), foreground-background (FB), class-based priority, and shortest remaining processing time (SRPT); to much more complicated scheduling rules, such as the famously complex Gittins index policy and other policies in which a job's priority changes arbitrarily with its age. While the response time of policies in the former category is well understood, policies in the latter category have resisted response time analysis. We present a universal analysis of all SOAP policies, deriving the mean and Laplace-Stieltjes transform of response time.
△ Less
Submitted 17 February, 2018; v1 submitted 3 December, 2017;
originally announced December 2017.
-
Practical Bounds on Optimal Caching with Variable Object Sizes
Authors:
Daniel S. Berger,
Nathan Beckmann,
Mor Harchol-Balter
Abstract:
Many recent caching systems aim to improve miss ratios, but there is no good sense among practitioners of how much further miss ratios can be improved. In other words, should the systems community continue working on this problem? Currently, there is no principled answer to this question. In practice, object sizes often vary by several orders of magnitude, where computing the optimal miss ratio (O…
▽ More
Many recent caching systems aim to improve miss ratios, but there is no good sense among practitioners of how much further miss ratios can be improved. In other words, should the systems community continue working on this problem? Currently, there is no principled answer to this question. In practice, object sizes often vary by several orders of magnitude, where computing the optimal miss ratio (OPT) is known to be NP-hard. The few known results on caching with variable object sizes provide very weak bounds and are impractical to compute on traces of realistic length.
We propose a new method to compute upper and lower bounds on OPT. Our key insight is to represent caching as a min-cost flow problem, hence we call our method the flow-based offline optimal (FOO). We prove that, under simple independence assumptions, FOO's bounds become tight as the number of objects goes to infinity. Indeed, FOO's error over 10M requests of production CDN and storage traces is negligible: at most 0.3%. FOO thus reveals, for the first time, the limits of caching with variable object sizes. While FOO is very accurate, it is computationally impractical on traces with hundreds of millions of requests. We therefore extend FOO to obtain more efficient bounds on OPT, which we call practical flow-based offline optimal (PFOO). We evaluate PFOO on several full production traces and use it to compare OPT to prior online policies. This analysis shows that current caching systems are in fact still far from optimal, suffering 11-43% more cache misses than OPT, whereas the best prior offline bounds suggest that there is essentially no room for improvement.
△ Less
Submitted 5 July, 2018; v1 submitted 10 November, 2017;
originally announced November 2017.
-
Delay Asymptotics and Bounds for Multi-Task Parallel Jobs
Authors:
Weina Wang,
Mor Harchol-Balter,
Haotian Jiang,
Alan Scheller-Wolf,
R. Srikant
Abstract:
We study delay of jobs that consist of multiple parallel tasks, which is a critical performance metric in a wide range of applications such as data file retrieval in coded storage systems and parallel computing. In this problem, each job is completed only when all of its tasks are completed, so the delay of a job is the maximum of the delays of its tasks. Despite the wide attention this problem ha…
▽ More
We study delay of jobs that consist of multiple parallel tasks, which is a critical performance metric in a wide range of applications such as data file retrieval in coded storage systems and parallel computing. In this problem, each job is completed only when all of its tasks are completed, so the delay of a job is the maximum of the delays of its tasks. Despite the wide attention this problem has received, tight analysis is still largely unknown since analyzing job delay requires characterizing the complicated correlation among task delays, which is hard to do.
We first consider an asymptotic regime where the number of servers, $n$, goes to infinity, and the number of tasks in a job, $k^{(n)}$, is allowed to increase with $n$. We establish the asymptotic independence of any $k^{(n)}$ queues under the condition $k^{(n)} = o(n^{1/4})$. This greatly generalizes the asymptotic-independence type of results in the literature where asymptotic independence is shown only for a fixed constant number of queues. As a consequence of our independence result, the job delay converges to the maximum of independent task delays.
We next consider the non-asymptotic regime. Here we prove that independence yields a stochastic upper bound on job delay for any $n$ and any $k^{(n)}$ with $k^{(n)}\le n$. The key component of our proof is a new technique we develop, called "Poisson oversampling". Our approach converts the job delay problem into a corresponding balls-and-bins problem. However, in contrast with typical balls-and-bins problems where there is a negative correlation among bins, we prove that our variant exhibits positive correlation.
△ Less
Submitted 15 September, 2018; v1 submitted 1 October, 2017;
originally announced October 2017.
-
Towards Optimality in Parallel Scheduling
Authors:
Benjamin Berg,
Jan-Pieter Dorsman,
Mor Harchol-Balter
Abstract:
To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these multi-core chips, one must address the question of how many cores to assign to each job. Given that jobs receive sublinear speedups from additional cor…
▽ More
To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these multi-core chips, one must address the question of how many cores to assign to each job. Given that jobs receive sublinear speedups from additional cores, there is an obvious tradeoff: allocating more cores to an individual job reduces the job's runtime, but in turn decreases the efficiency of the overall system. We ask how the system should schedule jobs across cores so as to minimize the mean response time over a stream of incoming jobs.
To answer this question, we develop an analytical model of jobs running on a multi-core machine. We prove that EQUI, a policy which continuously divides cores evenly across jobs, is optimal when all jobs follow a single speedup curve and have exponentially distributed sizes. EQUI requires jobs to change their level of parallelization while they run. Since this is not possible for all workloads, we consider a class of "fixed-width" policies, which choose a single level of parallelization, k, to use for all jobs. We prove that, surprisingly, it is possible to achieve EQUI's performance without requiring jobs to change their levels of parallelization by using the optimal fixed level of parallelization, k*. We also show how to analytically derive the optimal k* as a function of the system load, the speedup curve, and the job size distribution.
In the case where jobs may follow different speedup curves, finding a good scheduling policy is even more challenging. We find that policies like EQUI which performed well in the case of a single speedup function now perform poorly. We propose a very simple policy, GREEDY*, which performs near-optimally when compared to the numerically-derived optimal policy.
△ Less
Submitted 31 October, 2017; v1 submitted 21 July, 2017;
originally announced July 2017.