-
SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies
Authors:
Tommaso Bonato,
Abdul Kabbani,
Daniele De Sensi,
Rong Pan,
Yanfang Le,
Costin Raiciu,
Mark Handley,
Timo Schneider,
Nils Blach,
Ahmad Ghalayini,
Daniel Alves,
Michael Papamichael,
Adrian Caulfield,
Torsten Hoefler
Abstract:
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple…
▽ More
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple sender-based CC algorithm that combines delay, ECN, and optional packet trimming for fast and precise window adjustments. At the core of SMaRTT lies the novel QuickAdapt algorithm that accurately estimates the bandwidth at the receiver. We show how to combine SMaRTT with a new per-packet traffic load-balancing algorithm called REPS to effectively reroute packets around congested hotspots as well as flaky or failing links. Our evaluation shows that SMaRTT alone outperforms EQDS, Swift, BBR, and MPRDMA by up to 50% on modern datacenter networks.
△ Less
Submitted 27 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Swing: Short-cutting Rings for Higher Bandwidth Allreduce
Authors:
Daniele De Sensi,
Tommaso Bonato,
David Saam,
Torsten Hoefler
Abstract:
The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely u…
▽ More
The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.
△ Less
Submitted 4 March, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
HammingMesh: A Network Topology for Large-Scale Deep Learning
Authors:
Torsten Hoefler,
Tommaso Bonato,
Daniele De Sensi,
Salvatore Di Girolamo,
Shigang Li,
Marco Heddes,
Jon Belk,
Deepak Goel,
Miguel Castro,
Steve Scott
Abstract:
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale t…
▽ More
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
△ Less
Submitted 21 October, 2022; v1 submitted 3 September, 2022;
originally announced September 2022.
-
Spectra and ground states of one- and two-dimensional laser-driven lattices of ultracold Rydberg atoms
Authors:
Wolfgang Zeller,
Michael Mayle,
Thorsten Bonato,
Gerhard Reinelt,
Peter Schmelcher
Abstract:
We investigate static properties of laser-driven, ultracold Rydberg atoms confined to one- and two-dimensional uniform lattices in the limit of vanishing laser coupling. The spectral structure of square lattices is compared to those of linear chains and similarities as well as differences are pointed out. Furthermore, we employ a method based on elements of graph theory to numerically determine th…
▽ More
We investigate static properties of laser-driven, ultracold Rydberg atoms confined to one- and two-dimensional uniform lattices in the limit of vanishing laser coupling. The spectral structure of square lattices is compared to those of linear chains and similarities as well as differences are pointed out. Furthermore, we employ a method based on elements of graph theory to numerically determine the laser detuning-dependent ground states of various lattice geometries. Ground states for chains as well as square and rectangular lattices are provided and discussed.
△ Less
Submitted 1 June, 2012; v1 submitted 24 February, 2012;
originally announced February 2012.