SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies
Authors:
Tommaso Bonato,
Abdul Kabbani,
Daniele De Sensi,
Rong Pan,
Yanfang Le,
Costin Raiciu,
Mark Handley,
Timo Schneider,
Nils Blach,
Ahmad Ghalayini,
Daniel Alves,
Michael Papamichael,
Adrian Caulfield,
Torsten Hoefler
Abstract:
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple…
▽ More
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple sender-based CC algorithm that combines delay, ECN, and optional packet trimming for fast and precise window adjustments. At the core of SMaRTT lies the novel QuickAdapt algorithm that accurately estimates the bandwidth at the receiver. We show how to combine SMaRTT with a new per-packet traffic load-balancing algorithm called REPS to effectively reroute packets around congested hotspots as well as flaky or failing links. Our evaluation shows that SMaRTT alone outperforms EQDS, Swift, BBR, and MPRDMA by up to 50% on modern datacenter networks.
△ Less
Submitted 27 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
Reliable Low-Delay Routing In Space with Routing-Oblivious LEO Satellites
Authors:
Stefano Vissicchio,
Mark Handley
Abstract:
Large networks of Low Earth Orbit (LEO) satellites are being built using inter-satellite lasers. These networks promise to offer low-latency wide-area connectivity, but reliably routing such traffic is difficult, as satellites are very resource-constrained and paths change constantly.
We present STARGLIDER, a new routing system where path computation is delegated to ground stations, while satell…
▽ More
Large networks of Low Earth Orbit (LEO) satellites are being built using inter-satellite lasers. These networks promise to offer low-latency wide-area connectivity, but reliably routing such traffic is difficult, as satellites are very resource-constrained and paths change constantly.
We present STARGLIDER, a new routing system where path computation is delegated to ground stations, while satellites are routing-oblivious and exchange no information at runtime. Yet, STARGLIDER satellites effectively support reliability primitives: they fast reroute packets over near-optimal paths when links fail, and validate that packets sent by potentially malicious ground stations follow reasonable paths.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.