Skip to main content

Showing 1–1 of 1 results for author: Warraich, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.06993  [pdf

    cs.DC cs.NI

    Ultima: Robust and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

    Authors: Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, Muhammad Shahbaz

    Abstract: We present Ultima, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. Ultima exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training to work with approximated gradients,… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: 12 pages