Skip to main content

Showing 1–13 of 13 results for author: De Sensi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.15888  [pdf, other

    cs.DC cs.PF

    Near-Optimal Wafer-Scale Reduce

    Authors: Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, Leighton Wilson, Daniele De Sensi, Torsten Hoefler

    Abstract: Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We in… ▽ More

    Submitted 16 May, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: To appear at HPDC 2024

    ACM Class: F.2.2

  2. arXiv:2404.01630  [pdf, other

    cs.NI

    SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies

    Authors: Tommaso Bonato, Abdul Kabbani, Daniele De Sensi, Rong Pan, Yanfang Le, Costin Raiciu, Mark Handley, Timo Schneider, Nils Blach, Ahmad Ghalayini, Daniel Alves, Michael Papamichael, Adrian Caulfield, Torsten Hoefler

    Abstract: With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple… ▽ More

    Submitted 27 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: Fixed typo and wrong y axis of one plot

  3. arXiv:2401.09356  [pdf, other

    cs.DC cs.LG cs.NI cs.PF

    Swing: Short-cutting Rings for Higher Bandwidth Allreduce

    Authors: Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler

    Abstract: The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely u… ▽ More

    Submitted 4 March, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    ACM Class: C.2.4; C.2.2

    Journal ref: NSDI 2024

  4. arXiv:2310.03742  [pdf, other

    cs.NI

    A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

    Authors: Nils Blach, Maciej Besta, Daniele De Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, Marcel Ferrari, Fabrizio Petrini, Torsten Hoefler

    Abstract: Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully a… ▽ More

    Submitted 21 April, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Journal ref: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI '24) Santa Clara, CA, USA April 16-18, 2024

  5. arXiv:2309.16214  [pdf, other

    cs.DC cs.NI

    Canary: Congestion-Aware In-Network Allreduce Using Dynamic Trees

    Authors: Daniele De Sensi, Edgar Costa Molero, Salvatore Di Girolamo, Laurent Vanbever, Torsten Hoefler

    Abstract: The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in t… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    ACM Class: C.2.1; C.2.2; C.2.4; C.5.1

  6. arXiv:2309.03628  [pdf, other

    cs.NI cs.DC cs.OS eess.SY

    OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

    Authors: Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, Timo Schneider, Daniele De Sensi, Luca Benini, Torsten Hoefler

    Abstract: Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set o… ▽ More

    Submitted 13 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: 12 pages, 14 figures, 103 references

  7. arXiv:2210.15315  [pdf, other

    cs.DC cs.NI cs.PF

    Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

    Authors: Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, Torsten Hoefler

    Abstract: Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introd… ▽ More

    Submitted 1 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in SIGMETRICS 2023

    ACM Class: C.2; C.4

  8. arXiv:2209.01346  [pdf, other

    cs.DC cs.AI cs.AR cs.NI cs.PF

    HammingMesh: A Network Topology for Large-Scale Deep Learning

    Authors: Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

    Abstract: Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale t… ▽ More

    Submitted 21 October, 2022; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: published at ACM/IEEE Supercomputing (SC22)

  9. arXiv:2206.10007  [pdf, other

    cs.NI

    Building Blocks for Network-Accelerated Distributed File Systems

    Authors: Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, Torsten Hoefler

    Abstract: High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reduci… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  10. arXiv:2202.08080  [pdf, other

    cs.CR cs.DC

    NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications

    Authors: Konstantin Taranov, Benjamin Rothenberger, Daniele De Sensi, Adrian Perrig, Torsten Hoefler

    Abstract: This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vect… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  11. arXiv:2106.15565  [pdf, other

    cs.DC cs.AR cs.NI

    Flare: Flexible In-Network Allreduce

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, Torsten Hoefler

    Abstract: The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    ACM Class: C.2.4; C.2.1; B.4.3

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '21) (2021)

  12. An In-Depth Analysis of the Slingshot Interconnect

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, Torsten Hoefler

    Abstract: The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks wit… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

    Comments: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020)

    ACM Class: C.2.1; C.2.2; C.2.4; C.5.1

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020)

  13. arXiv:1909.07865  [pdf, other

    cs.DC cs.NI cs.PF

    Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Torsten Hoefler

    Abstract: System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops,… ▽ More

    Submitted 17 September, 2019; originally announced September 2019.

    Comments: Accepted at The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '19)

    ACM Class: C.2.2; C.2.1; C.2.4; C.5.1

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '19) (2019)