Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
Authors:
Dharma Shukla,
Muthian Sivathanu,
Srinidhi Viswanatha,
Bhargav Gulavani,
Rimma Nehme,
Amey Agrawal,
Chen Chen,
Nipun Kwatra,
Ramachandran Ramjee,
Pankaj Sharma,
Atul Katiyar,
Vipul Modi,
Vaibhav Sharma,
Abhishek Singh,
Shreshth Singhal,
Kaustubh Welankar,
Lu Xun,
Ravi Anupindi,
Karthik Elangovan,
Hasibur Rahman,
Zhou Lin,
Rahul Seetharaman,
Cheng Xu,
Eddie Ailijiang,
Suresh Krishnappa
, et al. (1 additional authors not shown)
Abstract:
Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically sca…
▽ More
Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (e.g., GPUs, FPGAs).
All jobs in Singularity are preemptable, migratable, and dynamically resizable (elastic) by default: a live job can be dynamically and transparently (a) preempted and migrated to a different set of nodes, cluster, data center or a region and resumed exactly from the point where the execution was preempted, and (b) resized (i.e., elastically scaled-up/down) on a varying set of accelerators of a given type. Our mechanisms are transparent in that they do not require the user to make any changes to their code or require using any custom libraries that may limit flexibility. Additionally, our approach significantly improves the reliability of deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance. Finally, our design approach is agnostic of DNN architectures and handles a variety of parallelism strategies (e.g., data/pipeline/model parallelism).
△ Less
Submitted 21 February, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
Kinetic theory of Hyaluronan cleavage by Bovine Testicular Hyaluronidase in Standard and Crowded Environments
Authors:
Reine Nehmé,
Rouba Nasreddine,
Lucija Orlic,
Chrystel Lopin-Bon,
Josef Hamacek,
Francesco Piazza
Abstract:
${\bf Background}$ Details of the kinetic pathways governing enzymatic cleavage of hyaluronic acid (HA) by hyaluronidase are still widely uncharted. Capillary electrophoresis-based assays were used for accurate quantification of enzymatic products. A crowding agent was also employed to mimic excluded-volume constraints typical of in-vivo conditions.
$ {\bf Scope}…
▽ More
${\bf Background}$ Details of the kinetic pathways governing enzymatic cleavage of hyaluronic acid (HA) by hyaluronidase are still widely uncharted. Capillary electrophoresis-based assays were used for accurate quantification of enzymatic products. A crowding agent was also employed to mimic excluded-volume constraints typical of in-vivo conditions.
$ {\bf Scope}$ Introduce a comprehensive kinetic model describing the late-stage degradation of HA by hyaluronidase and identify the relevant kinetic pathways and the associated rates.
${\bf Major Conclusions}$ All relevant fragmentation and transglycosylation pathways and rates were identified. Two dimers forming a tetramer is the dominant recombination pathway. Macromolecular and self-crowding slow down the kinetics but do not alter the underlying mechanisms.
${\bf General Significance}$ Our results bring a novel and comprehensive quantitative insight into enzymatic HA degradation. Rationalizing the effect of crowding brings the intricate conditions of in-vivo settings a little closer, and also stands as a powerful tool to pinpoint relevant kinetic pathways in complex systems.
△ Less
Submitted 29 November, 2020;
originally announced November 2020.