-
Challenging the Need for Packet Spraying in Large-Scale Distributed Training
Authors:
Vamsi Addanki,
Prateesh Goyal,
Ilias Marinos
Abstract:
Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads.
In this pa…
▽ More
Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads.
In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for develo** next-generation transport protocols tailored for large-scale distributed training.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Understanding the Throughput Bounds of Reconfigurable Datacenter Networks
Authors:
Vamsi Addanki,
Chen Avin,
Stefan Schmid
Abstract:
The increasing gap between the growth of datacenter traffic volume and the capacity of electrical switches led to the emergence of reconfigurable datacenter network designs based on optical circuit switching. A multitude of research works, ranging from demand-oblivious (e.g., RotorNet, Sirius) to demand-aware (e.g., Helios, ProjecToR) reconfigurable networks, demonstrate significant performance be…
▽ More
The increasing gap between the growth of datacenter traffic volume and the capacity of electrical switches led to the emergence of reconfigurable datacenter network designs based on optical circuit switching. A multitude of research works, ranging from demand-oblivious (e.g., RotorNet, Sirius) to demand-aware (e.g., Helios, ProjecToR) reconfigurable networks, demonstrate significant performance benefits. Unfortunately, little is formally known about the achievable throughput of such networks. Only recently have the throughput bounds of demand-oblivious networks been studied. In this paper, we tackle a fundamental question: Whether and to what extent can demand-aware reconfigurable networks improve the throughput of datacenters?
This paper attempts to understand the landscape of the throughput bounds of reconfigurable datacenter networks. Given the rise of machine learning workloads and collective communication in modern datacenters, we specifically focus on their typical communication patterns, namely uniform-residual demand matrices. We formally establish a separation bound of demand-aware networks over demand-oblivious networks, proving analytically that the former can provide at least $16\%$ higher throughput. Our analysis further uncovers new design opportunities based on periodic, fixed-duration reconfigurations that can harness the throughput benefits of demand-aware networks while inheriting the simplicity and low reconfiguration overheads of demand-oblivious networks. Finally, our evaluations corroborate the theoretical results of this paper, demonstrating that demand-aware networks significantly outperform oblivious networks in terms of throughput. This work barely scratches the surface and unveils several intriguing open questions, which we discuss at the end of this paper.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions
Authors:
Vamsi Addanki,
Maciej Pacut,
Stefan Schmid
Abstract:
Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms.…
▽ More
Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance.
This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by $1.5$x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to $95\%$ using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks
Authors:
Vamsi Addanki,
Chen Avin,
Stefan Schmid
Abstract:
The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, \emph{periodic} reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2…
▽ More
The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, \emph{periodic} reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a \emph{complete graph} through fast periodic circuit switch scheduling.
However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph.
We present Mars, a periodic reconfigurable topology which emulates a $d$-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree~$d$ can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.
△ Less
Submitted 28 December, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
PowerTCP: Pushing the Performance Limits of Datacenter Networks
Authors:
Vamsi Addanki,
Oliver Michel,
Stefan Schmid
Abstract:
Increasingly stringent throughput and latency requirements in datacenter networks demand fast and accurate congestion control. We observe that the reaction time and accuracy of existing datacenter congestion control schemes are inherently limited. They either rely only on explicit feedback about the network state (e.g., queue lengths in DCTCP) or only on variations of state (e.g., RTT gradient in…
▽ More
Increasingly stringent throughput and latency requirements in datacenter networks demand fast and accurate congestion control. We observe that the reaction time and accuracy of existing datacenter congestion control schemes are inherently limited. They either rely only on explicit feedback about the network state (e.g., queue lengths in DCTCP) or only on variations of state (e.g., RTT gradient in TIMELY). To overcome these limitations, we propose a novel congestion control algorithm, PowerTCP, which achieves much more fine-grained congestion control by adapting to the bandwidth-window product (henceforth called power). PowerTCP leverages in-band network telemetry to react to changes in the network instantaneously without loss of throughput and while kee** queues short. Due to its fast reaction time, our algorithm is particularly well-suited for dynamic network environments and bursty traffic patterns. We show analytically and empirically that PowerTCP can significantly outperform the state-of-the-art in both traditional datacenter topologies and emerging reconfigurable datacenters where frequent bandwidth changes make congestion control challenging. In traditional datacenter networks, PowerTCP reduces tail flow completion times of short flows by 80% compared to DCQCN and TIMELY, and by 33% compared to HPCC even at 60% network load. In reconfigurable datacenters, PowerTCP achieves 85% circuit utilization without incurring additional latency and cuts tail latency by at least 2x compared to existing approaches.
△ Less
Submitted 28 December, 2021;
originally announced December 2021.
-
Self-Adjusting Packet Classification
Authors:
Maciej Pacut,
Juan Vanerio,
Vamsi Addanki,
Arash Pourdamghani,
Gabor Retvari,
Stefan Schmid
Abstract:
This paper is motivated by the vision of more efficient packet classification mechanisms that self-optimize in a demand-aware manner. At the heart of our approach lies a self-adjusting linear list data structure, where unlike in the classic data structure, there are dependencies, and some items must be in front of the others; for example, to correctly classify packets by rules arranged in a linked…
▽ More
This paper is motivated by the vision of more efficient packet classification mechanisms that self-optimize in a demand-aware manner. At the heart of our approach lies a self-adjusting linear list data structure, where unlike in the classic data structure, there are dependencies, and some items must be in front of the others; for example, to correctly classify packets by rules arranged in a linked list, each rule must be in front of lower priority rules that overlap with it. After each access we can rearrange the list, similarly to Move-To-Front, but dependencies need to be respected.
We present a 4-competitive online rearrangement algorithm, whose cost is at most four times worse than the optimal offline algorithm; no deterministic algorithm can be better than 3-competitive. The algorithm is simple and attractive, especially for memory-limited systems, as it does not require any additional memory (e.g., neither timestamps nor frequency statistics). Our approach can simply be deployed as a drop-in replacement for a static datastructure, potentially benefitting many existing networks.
We evaluate our self-adjusting list packet classifier on realistic ruleset and traffic instances. We find that our classifier performs similarly to a static list for low-locality traffic, but significantly outperforms Efficuts (by a factor 7x), CutSplit (3.6x), and the static list (14x) for high locality and small rulesets. Memory consumption is 10x lower on average compared to Efficuts and CutSplit.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
FB: A Flexible Buffer Management Scheme for Data Center Switches
Authors:
Maria Apostolaki,
Vamsi Addanki,
Manya Ghobadi,
Laurent Vanbever
Abstract:
Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, low-priority traffic can cause increased packet loss to high-priority traffic. Similarly, long flows can prevent the buffer from absorbing incoming bursts even if…
▽ More
Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, low-priority traffic can cause increased packet loss to high-priority traffic. Similarly, long flows can prevent the buffer from absorbing incoming bursts even if they do not share the same queue. The cause of this perhaps unintuitive outcome is that today's buffer sharing techniques are unable to guarantee isolation across (priority) queues without statically allocating buffer space. To address this issue, we designed FB, a novel buffer sharing scheme that offers strict isolation guarantees to high-priority traffic without sacrificing link utilizations. Thus, FB outperforms conventional buffer sharing algorithms in absorbing bursts while achieving on-par throughput. We show that FB is practical and runs at line-rate on existing hardware (Barefoot Tofino). Significantly, FB's operations can be approximated in non-programmable devices.
△ Less
Submitted 21 May, 2021;
originally announced May 2021.
-
Online List Access with Precedence Constraints
Authors:
Maciej Pacut,
Juan Vanerio,
Vamsi Addanki,
Arash Pourdamghani,
Gabor Retvari,
Stefan Schmid
Abstract:
This paper considers a natural generalization of the online list access problem in the paid exchange model, where additionally there can be precedence constraints ("dependencies") among the nodes in the list. For example, this generalization is motivated by applications in the context of packet classification. Our main contributions are constant-competitive deterministic and randomized online algo…
▽ More
This paper considers a natural generalization of the online list access problem in the paid exchange model, where additionally there can be precedence constraints ("dependencies") among the nodes in the list. For example, this generalization is motivated by applications in the context of packet classification. Our main contributions are constant-competitive deterministic and randomized online algorithms, designed around a procedure Move-Recursively-Forward, a generalization of Move-To-Front tailored to handle node dependencies. Parts of the analysis build upon ideas of the classic online algorithms Move-To-Front and BIT, and address the challenges of the extended model. We further discuss the challenges related to insertions and deletions.
△ Less
Submitted 18 April, 2021;
originally announced April 2021.
-
Alias Resolution Based on ICMP Rate Limiting
Authors:
Kevin Vermeulen,
Burim Ljuma,
Vamsi Addanki,
Matthieu Gouel,
Olivier Fourmaux,
Timur Friedman,
Reza Rejaie
Abstract:
Alias resolution techniques (e.g., Midar) associate, mostly through active measurement, a set of IP addresses as belonging to a common router. These techniques rely on distinct router features that can serve as a signature. Their applicability is affected by router support of the features and the robustness of the signature. This paper presents a new alias resolution tool called Limited Ltd. that…
▽ More
Alias resolution techniques (e.g., Midar) associate, mostly through active measurement, a set of IP addresses as belonging to a common router. These techniques rely on distinct router features that can serve as a signature. Their applicability is affected by router support of the features and the robustness of the signature. This paper presents a new alias resolution tool called Limited Ltd. that exploits ICMP rate limiting, a feature that is increasingly supported by modern routers that has not previously been used for alias resolution. It sends ICMP probes toward target interfaces in order to trigger rate limiting, extracting features from the probe reply loss traces. It uses a machine learning classifier to designate pairs of interfaces as aliases. We describe the details of the algorithm used by Limited Ltd. and illustrate its feasibility and accuracy. Limited Ltd. not only is the first tool that can perform alias resolution on IPv6 routers that do not generate monotonically increasing fragmentation IDs (e.g., Juniper routers) but it also complements the state-of-the-art techniques for IPv4 alias resolution. All of our code and the collected dataset are publicly available.
△ Less
Submitted 1 February, 2020;
originally announced February 2020.