Skip to main content

Showing 1–19 of 19 results for author: Di Girolamo, S

.
  1. arXiv:2309.16214  [pdf, other

    cs.DC cs.NI

    Canary: Congestion-Aware In-Network Allreduce Using Dynamic Trees

    Authors: Daniele De Sensi, Edgar Costa Molero, Salvatore Di Girolamo, Laurent Vanbever, Torsten Hoefler

    Abstract: The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in t… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    ACM Class: C.2.1; C.2.2; C.2.4; C.5.1

  2. arXiv:2309.03628  [pdf, other

    cs.NI cs.DC cs.OS eess.SY

    OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

    Authors: Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, Timo Schneider, Daniele De Sensi, Luca Benini, Torsten Hoefler

    Abstract: Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set o… ▽ More

    Submitted 13 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: 12 pages, 14 figures, 103 references

  3. arXiv:2210.15315  [pdf, other

    cs.DC cs.NI cs.PF

    Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

    Authors: Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, Torsten Hoefler

    Abstract: Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introd… ▽ More

    Submitted 1 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in SIGMETRICS 2023

    ACM Class: C.2; C.4

  4. arXiv:2209.01346  [pdf, other

    cs.DC cs.AI cs.AR cs.NI cs.PF

    HammingMesh: A Network Topology for Large-Scale Deep Learning

    Authors: Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

    Abstract: Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale t… ▽ More

    Submitted 21 October, 2022; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: published at ACM/IEEE Supercomputing (SC22)

  5. arXiv:2206.10007  [pdf, other

    cs.NI

    Building Blocks for Network-Accelerated Distributed File Systems

    Authors: Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, Torsten Hoefler

    Abstract: High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reduci… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  6. arXiv:2202.13976  [pdf, other

    cs.DC

    Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching

    Authors: András Strausz, Flavio Vella, Salvatore Di Girolamo, Maciej Besta, Torsten Hoefler

    Abstract: Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementat… ▽ More

    Submitted 1 March, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

    Comments: 11 pages, 10 figures, to be published at IPDPS'22

  7. arXiv:2106.15565  [pdf, other

    cs.DC cs.AR cs.NI

    Flare: Flexible In-Network Allreduce

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, Torsten Hoefler

    Abstract: The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    ACM Class: C.2.4; C.2.1; B.4.3

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '21) (2021)

  8. arXiv:2105.12663  [pdf, other

    cs.NI cs.DC cs.PF

    Towards Million-Server Network Simulations on Just a Laptop

    Authors: Maciej Besta, Marcel Schneider, Salvatore Di Girolamo, Ankit Singla, Torsten Hoefler

    Abstract: The growing size of data center and HPC networks pose unprecedented requirements on the scalability of simulation infrastructure. The ability to simulate such large-scale interconnects on a simple PC would facilitate research efforts. Unfortunately, as we first show in this work, existing shared-memory packet-level simulators do not scale to the sizes of the largest networks considered today. We t… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

  9. arXiv:2104.07582  [pdf, other

    cs.AR cs.DC cs.DS cs.PF

    SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

    Authors: Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, Juan Gómez Luna, Marcin Copik, Lukas Kapp-Schwoerer, Salvatore Di Girolamo, Marek Konieczny, Nils Blach, Onur Mutlu, Torsten Hoefler

    Abstract: Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with nonstraightforward paralleli… ▽ More

    Submitted 25 October, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Proceedings of the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO'21), 2021

  10. arXiv:2010.03536  [pdf, other

    cs.NI cs.DC

    PsPIN: A high-performance low-power architecture for flexible in-network compute

    Authors: Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beránek, Luca Benini, Torsten Hoefler

    Abstract: The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. H… ▽ More

    Submitted 1 June, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

  11. An In-Depth Analysis of the Slingshot Interconnect

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, Torsten Hoefler

    Abstract: The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks wit… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

    Comments: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020)

    ACM Class: C.2.1; C.2.2; C.2.4; C.5.1

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020)

  12. arXiv:2007.03776  [pdf, other

    cs.NI cs.DC cs.PF

    High-Performance Routing with Multipathing and Path Diversity in Ethernet and HPC Networks

    Authors: Maciej Besta, Jens Domke, Marcel Schneider, Marek Konieczny, Salvatore Di Girolamo, Timo Schneider, Ankit Singla, Torsten Hoefler

    Abstract: The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos… ▽ More

    Submitted 29 October, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

    Journal ref: IEEE Transactions on Parallel and Distributed Systems (TPDS), 2021

  13. Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Authors: Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler

    Abstract: Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating… ▽ More

    Submitted 20 February, 2021; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 2021

    ACM Class: C.1.4; D.1.3; I.2

    Journal ref: in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1725-1739, 1 July 2021

  14. arXiv:1909.07865  [pdf, other

    cs.DC cs.NI cs.PF

    Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing

    Authors: Daniele De Sensi, Salvatore Di Girolamo, Torsten Hoefler

    Abstract: System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops,… ▽ More

    Submitted 17 September, 2019; originally announced September 2019.

    Comments: Accepted at The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '19)

    ACM Class: C.2.2; C.2.1; C.2.4; C.5.1

    Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '19) (2019)

  15. Network-Accelerated Non-Contiguous Memory Transfers

    Authors: Salvatore Di Girolamo, Konstantin Taranov, Andreas Kurth, Michael Schaffner, Timo Schneider, Jakub Beránek, Maciej Besta, Luca Benini, Duncan Roweth, Torsten Hoefler

    Abstract: Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate w… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Nov. 2019

  16. Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Authors: Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

    Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, w… ▽ More

    Submitted 25 February, 2020; v1 submitted 12 August, 2019; originally announced August 2019.

    Comments: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 2020

  17. arXiv:1906.10885  [pdf, other

    cs.NI

    FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short

    Authors: Maciej Besta, Marcel Schneider, Karolina Cynk, Marek Konieczny, Erik Henriksson, Salvatore Di Girolamo, Ankit Singla, Torsten Hoefler

    Abstract: We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performan… ▽ More

    Submitted 11 November, 2020; v1 submitted 26 June, 2019; originally announced June 2019.

    Journal ref: Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC20), November 2020

  18. arXiv:1902.03154  [pdf, other

    cs.DC

    SimFS: A Simulation Data Virtualizing File System Interface

    Authors: Salvatore Di Girolamo, Pirmin Schmid, Thomas Schulthess, Torsten Hoefler

    Abstract: Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes of data for long periods of time is not cost effective and, in some cases, practically impossible. We propose to transparently virtualize the simulation data,… ▽ More

    Submitted 24 January, 2019; originally announced February 2019.

  19. arXiv:1709.05483  [pdf, other

    cs.DC

    sPIN: High-performance streaming Processing in the Network

    Authors: Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell

    Abstract: Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload s… ▽ More

    Submitted 19 October, 2017; v1 submitted 16 September, 2017; originally announced September 2017.

    Comments: 20 pages