Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

Zhao, Hailiang; Deng, Shuiguang; Xiang, Zhengzhe; Yan, Xueqiang; Yin, Jianwei; Dustdar, Schahram; Zomaya, Albert Y.

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2305.06572v1 (cs)

[Submitted on 11 May 2023 (this version), latest version 5 Aug 2023 (v2)]

Title:Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

Authors:Hailiang Zhao, Shuiguang Deng, Zhengzhe Xiang, Xueqiang Yan, Jianwei Yin, Schahram Dustdar, Albert Y. Zomaya

View PDF

Abstract:Nowadays, multi-server jobs, which request multiple computing devices and hold onto them during their execution, dominate modern computing clusters. When allocating computing devices to them, it is difficult to make the tradeoff between the parallel computation gains and the internal communication overheads. Firstly, the computing gain does not increase linearly with computing devices. Secondly, the device type which dominates the communication overhead is various to different job types. To achieve a better gain-overhead tradeoff, we formulate an accumulative reward maximization program and design an online algorithm, i.e., OGASched, to schedule multi-server jobs. The reward of a job is formulated as the parallel computation gain aggregated over the allocated computing devices minus the penalty on the dominant communication overhead. OGASched allocates computing devices to each arrived job in the ascending direction of the reward gradients. OGASched has a best-so-far regret with concave rewards, which grows sublinearly with the number of job types and the time slot length. OGASched has several parallel sub-procedures to accelerate its computation, which greatly reduces the complexity. We conduct extensive trace-driven simulations to validate the performance of OGASched. The results demonstrate that OGASched outperforms widely used heuristics by $11.33\%$, $7.75\%$, $13.89\%$, and $13.44\%$, respectively.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2305.06572 [cs.DC]
	(or arXiv:2305.06572v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2305.06572

Submission history

From: Hailiang Zhao [view email]
[v1] Thu, 11 May 2023 05:17:02 UTC (8,173 KB)
[v2] Sat, 5 Aug 2023 09:13:33 UTC (8,557 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators