\xpatchcmd

00 \hfsetbordercolorwhite \hfsetfillcolorvlgray \stackMath \addbibresourcereferences_part1.bib \addbibresourcereferences_part2.bib

A High-Performance Design, Implementation, Deployment,
and Evaluation of The Slim Fly Network

Nils Blach1, Maciej Besta1, Daniele De Sensi1,2, Jens Domke3,
  Hussein Harake5, Shigang Li1,4, Patrick Iff1, Marek Konieczny6, Kartik Lakhotia7,   
Ales Kubicek1, Marcel Ferrari1, Fabrizio Petrini7, Torsten Hoefler1
   1 ETH Zürich  2 Sapienza University of Rome  3 RIKEN Center for Computational Science (R-CCS)
4 BUPT, Bei**g  5 Swiss National Supercomputing Centre (CSCS)  6 AGH-UST  7 Intel Labs
   { nils.blach, maciej.besta, htor } @ inf.ethz.ch
Abstract

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF’s strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source)111https://github.com/spcl/opensm routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.

[Uncaptioned image]
Figure 1: First real-world deployment of the Slim Fly topology. The left-most rack displays labels detailing the arrangement of various components such as InfiniBand (IB) switches, compute nodes and Ethernet switches. Two types of IB links are present: black copper links for intra-rack connections and orange optical fiber links for inter-rack connections. The orange lines above the racks represent bundles of ten optical fiber links each. Additionally, blue, white and green (arbitrary color scheme) Ethernet cables are visible within the racks, which establish the cluster management network together with the Ethernet switches.

1 INTRODUCTION

Low-diameter222Network diameter is the maximum distance between any two switches. network topologies such as Slim Fly (SF) [besta2014slim] have gained significant traction during the last decade. Initial designs in that line of work, Dragonfly (DF) [dally08] and Flattened Butterfly [dally07], both with diameter three, focused on improving latency and physical layout. After that, SF lowered the diameter to two, based on an observation that low-diameter does not only improve performance by reducing end-to-end latencies, but it also reduces cost and power consumption. This is because, when the diameter is lower, packets on average traverse fewer switches, switch buffers, and links. Thus, fewer links and buffers are needed to construct the network (for a fixed bandwidth), and less dynamic power is needed for moving the packets through the network.

SF’s construction costs, consumed power, and latency are lower than those of Clos and Fat Tree (FT) by respectively, \approx25-30%, \approx25-30%, and \approx50% [besta2014slim]. However, SF has still not seen a real physical deployment, and it is uncertain how to deploy SF in practice. To spearhead the practical development of low-diameter networks and show the state-of-the-practice, we design, implement, deploy, and evaluate the first SF installation that includes switches and endpoints, as shown in Fig. 1. We discuss the encountered challenges, and we show that the construction process is straightforward and comparable to established designs such as Clos.

Moreover, to maximize performance benefits from using SF, we design and implement a novel high-performance multipath routing scheme for general low-diameter networks, and we install and use it with the deployed SF cluster. Our routing shows superior performance over the state-of-the-art, and it is independent of the underlying topology details and of the interconnect architecture. Thus, it could be portably used on different topologies (e.g., Xpander [valadarsky2015]) and on different architectures (e.g., Ethernet or InfiniBand [IBAspec]).

Refer to caption
Figure 2: The structure of a small example Fat Tree (FT), Dragonfly (DF), and Slim Fly (SF), and the corresponding installations. Each topology comes with a modular design, where switches form groups (SF, DF) or pods (FT). Such groups can become racks in a physical installation.

The equipment available to us is based on the InfiniBand (IB) architecture [IBAspec]. IB enables a high-speed switched fabric with hardware (HW) support for remote direct memory access (RDMA) [gerstenberger2013enabling, di2019network]. IB is widely used in high-performance systems, for example four out of ten most powerful systems in the Top500 list (Jun. 2023 issue) [dongarra1997top500], manufactured by IBM, Nvidia, and Atos, use the IB interconnect. We use our routing protocol with the IB networking stack; our whole implementation is publicly available to foster future research into multipath routing. Importantly, we provide the first multipathing for IB that can use arbitrary paths (including non-minimal and disjoint ones) and that is independent of the structure details of the underlying network [besta2020highr, domke_hyperx_2019].

In our evaluation, we consider a broad range of communication-intense applications that represent traditional dense computations (like physics simulations), sparse graph processing [besta2017push, besta2019demystifying, sisa, gms, besta2020high, besta2019practice], deep neural network (DNN) training [ben2019modular, besta2022parallel, besta2023high], and a number of microbenchmarks testing particular popular communication patterns. Our results showcase that SF delivers high performance while achieving optimal, or near optimal scalability, which directly translates to low construction costs. To further reinforce these outcomes, we also conduct a comprehensive comparison between SF and a non-blocking FT that we deploy using the same hardware. Here, SF offers comparable or better performance to FT in a majority of used applications. Simultaneously, its superior scalability ensures up to 50% cost improvements over FT, particularly for large installation sizes [besta2014slim].

2 NETWORK MODEL & TOPOLOGIES

We start with fundamental concepts and notation. We model a network as an undirected graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ); V𝑉Vitalic_V is a set of switches333We abstract away HW details and denote switches and routers with a common term “switch”. However, we use a term “routing” when referring to determining a path, because IB switches in our physical implementation have routing capabilities. (|V|=Nr𝑉subscript𝑁𝑟|V|=N_{r}| italic_V | = italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) and E𝐸Eitalic_E is a set of full-duplex inter-switch cables (we do not model endpoints explicitly). A network has N𝑁Nitalic_N endpoints, with p𝑝pitalic_p endpoints attached to each switch (concentration). We also use the term node to refer to either a switch or any of its endpoints, when the discussion is generic. Total port count in a switch (radix) is k=k+p𝑘superscript𝑘𝑝k=k^{\prime}+pitalic_k = italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_p, where ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the number of channels from a switch to other switches (network radix). The diameter is D𝐷Ditalic_D. All the symbols are listed in Tab. 1.

Table 1: The most important symbols used in this work.
V,E𝑉𝐸V,Eitalic_V , italic_E Sets of vertices/edges (switches/links, V={0,,Nr1}𝑉0subscript𝑁𝑟1V=\{0,\dots,N_{r}-1\}italic_V = { 0 , … , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 }).
N𝑁Nitalic_N The number of endpoints in the network.
Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT The number of switches in the network (Nr=|V|subscript𝑁𝑟𝑉N_{r}=|V|italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = | italic_V |).
p𝑝pitalic_p The number of endpoints attached to a switch.
ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT The number of channels from a switch to other switches.
k𝑘kitalic_k Switch radix (k=k+p𝑘superscript𝑘𝑝k=k^{\prime}+pitalic_k = italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_p).
D,d𝐷𝑑D,ditalic_D , italic_d Network diameter and the average path length.

We overview SF’s structure in Fig. 2, and compare it to a 3–level Fat Tree with diameter four, as they are widely used in medium and large installations [al2008scalable, niranjan2009portland], and to a diameter-3 Dragonfly, which has also been deployed in practice [aries, slingshot-desensi]. SF has >>>50% fewer switches and >>>55% fewer cables than a full-bandwidth non-blocking FT of a comparable size. Second, SF’s switches form groups that are not necessarily fully connected; FT’s edge and aggregation switches form pods, DF’s groups are fully connected. Third, both SF and DF are direct topologies (each switch is attached to some number of servers), while in a FT, only edge switches attach to servers.

3 FIRST AT-SCALE SF INSTALLATION

Refer to caption
Figure 3: Internal organization of a rack. The image displays a side-by-side comparison of a theoretical diagram and an actual photograph of a single rack in the cluster. The rack consists of two distinct subgroups, each housing 5 IB switches and 40 compute nodes (endpoints). Each IB switch is connected to 4 endpoints and 7 other IB switches.

We start by discussing the deployment of the first SF cluster, illustrating the simplicity of its construction and arguing why deploying other SFs would also be straightforward. The cluster is hosted by the Swiss National Supercomputing Centre (CSCS).

3.1 Deployed Hardware Equipment

We use 50505050 36363636-port, 56565656Gb/s IB SX6036 switches and 200200200200 compute endpoints. Each endpoint hosts two 20202020-core Intel Xeon CPUs and 32323232 GiB RAM, split equally in a Non-Uniform Memory Access (NUMA) configuration, and a single Mellanox ConnectX-3 MT4099 HCA, which implements the IB Architecture Specification Volume 1, Release 1.2. Copper and optical cables are used for intra and inter-rack switch connections, respectively.

3.2 Topology Structure and Construction

We use a SF based on the graphs by McKay, Miller, and Širáň [mckay1998note]. We outline its structure, the details are in Appendix A and in the original SF paper [besta2014slim]. The complete SF installation is shown in Fig. 1 with a highlighted view of the group structure in Fig. 3. One first chooses a prime power q𝑞qitalic_q; q𝑞qitalic_q is an input parameter that determines the whole topology structure. For example, the number of vertices (switches) is Nr=2q2subscript𝑁𝑟2superscript𝑞2N_{r}=2q^{2}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 2 italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the network radix k=3qδ2superscript𝑘3𝑞𝛿2k^{\prime}=\frac{3q-\delta}{2}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 3 italic_q - italic_δ end_ARG start_ARG 2 end_ARG. In our case, Nr=50subscript𝑁𝑟50N_{r}=50italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 50, thus q=5𝑞5q=5italic_q = 5 and k=7superscript𝑘7k^{\prime}=7italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 7 (every switch connects to 7777 other switches). Interestingly, this construction forms the famous Hoffman-Singleton graph [hoffman1960moore, hafner2003hoffman], which is optimal with respect to the Moore Bound [MooreBound]. Finally, one uses p=k2𝑝superscript𝑘2p=\left\lceil\frac{k^{\prime}}{2}\right\rceilitalic_p = ⌈ divide start_ARG italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⌉ endpoints connected to each switch to ensure full global bandwidth [besta2014slim]. In our case, p=4𝑝4p=4italic_p = 4. Note that, while the switch port count in the considered SF is k+p=11superscript𝑘𝑝11k^{\prime}+p=11italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_p = 11 (and 11-port switches would be the appropriate selection when building the SF from scratch), we use 36-port switches because this has been the only HW equipment available to us.

The whole installation consists of five identical racks. Every two racks are connected with the same number of 2q=102𝑞102q=102 italic_q = 10 cables. There are 2q=102𝑞102q=102 italic_q = 10 switches in each rack. Each rack consists of two subgroups, subgroup 0 and subgroup 1. All subgroups 0 and all subgroups 1 are identical, but a subgroup 0 and 1 are usually different. We place switches from subgroup 0, together with their attached endpoints, at the top of each rack; subgroup 1 goes to the bottom of the rack. The details on how any two switches are connected is determined by the underlying algebraic structure of the SF topology. We offer full details in Appendix A, with § A.3 explaining the three simple equations that determine switch connectivity; here, we stress that the deployment is straightforward.

3.3 Deployment Efficiency and Ease

To facilitate deployment, we develop scripts that outline both intra- and inter-rack connections. The output of these scripts can be used to create diagrams for every rack pair to ensure a smooth wiring process. Thanks to the algebraic structure of the SF topology, such descriptions for any SF can be automatically generated, providing concrete port-to-port link descriptions and rack placements for each switch. We illustrate an example diagram of connections between racks 0 and 1, and between 0 and 2, that was created based on these generated descriptions, in Fig. 4.

We use our scripts as a basis of an efficient 3-step wiring process. First, we wire intra-subgroup connections; they are identical across all racks for each of the two subgroups. The second step consists of connecting each switch from subgroup 0 to its neighboring switches in subgroup 1 within the same rack. As the subgroups are of equal size, an incorrectly connected pair will result in easily recognizable errors, which break that symmetry. Lastly, the inter-rack connections are established. Hereby, the fact that each switch in a rack uses the same port to connect to the switches in another rack, enables straightforward connection of rack-pairs.

The simplicity of the wiring process can mainly be attributed to the scalable three-step approach, which is equally applicable to larger SF topologies, enabling the efficient deployment of SF clusters. Overall, strip** the previous system and executing the 3-step wiring process were completed within 3 days by a team of two.

Refer to caption
Figure 4: Illustration of the example diagrams created from the output of our scripts, facilitating the cabling process. The diagrams show all the inter-rack connections and the corresponding ports in switches. Each switch is labeled using a triple (S,R,I)𝑆𝑅𝐼(S,R,I)( italic_S , italic_R , italic_I ), where S{0,1}𝑆01S\in\{0,1\}italic_S ∈ { 0 , 1 } indicates the subgroup type, R{0,,4}𝑅04R\in\{0,...,4\}italic_R ∈ { 0 , … , 4 } indicates the rack, and I{0,,4}𝐼04I\in\{0,...,4\}italic_I ∈ { 0 , … , 4 } is the consecutive switch ID within a rack/subgroup. Then, we only show ports 8–11; these ports are used to connect racks. Ports 1–4 (for endpoints) and 5–7 (for intra-rack switch-switch links) are omitted for clarity. The equations presented in § A.3 determine which switches are connected based on the assigned labels.

3.4 Correctness Verification

We provide a set of scripts that ensure the correctness of the cabling. These scripts utilize the auto-generated port-to-port link descriptions and rack placements for each switch and compare it with the output of ibnetdiscover, an IB command that performs fabric discovery. This allows us to not only identify incorrectly wired cables and provide concrete instructions on how to rectify mistakes, but also detect missing or broken links. These scripts could even be used on a live cluster, while going through the wiring process, to immediately identify and flag errors.

4 HIGH-PERFORMANCE MULTIPATHING

We now propose a novel high-performance multipath routing protocol for low-diameter networks, which we use on the described SF deployment. For this, we extend the recently proposed FatPaths multipath routing protocol [besta2020fatpaths] so that it offers vastly superior throughput while still ensuring very low latency.

Refer to caption
Figure 5: Layered routing in FatPaths and in this work. Traffic is divided and sent using different layers. Our scheme relaxes the requirement in FatPaths for all layers to be trees, as in our scheme deadlock resolution is decoupled from layer creation. This ensures more flexibility in develo** layers, leading to more throughput. Specifically, while in FatPaths, paths in different layers often overlap (cf. Layer 1 and 2), our routing alleviates this issue and reduces overlap/congestion and increases performance.

4.1 Original FatPaths Routing in Slim Fly

In terms of path diversity, FT has multiple same-length minimal paths between any two edge switches. Thus, one often uses ECMP [hopps2000analysis] for multipath routing in FT. In SF (and to some degree in DF [besta2020fatpaths]), there is usually only one minimal path, but multiple “almost” minimal paths between any switch pair. This makes it challenging to achieve high path diversity in SF using ECMP. To alleviate this and to enable non-minimal high-speed multipathing in SF, the FatPaths architecture has recently been proposed [besta2020fatpaths]. FatPaths harnesses the concept of layered routing [mudigonda2010spain, stephens2012past] for low-diameter networks. In layered routing, one first creates layers: subsets of switch-switch links. Within one layer, one uses shortest-path routing. However, as a layer does not contain all the links, paths within this layer are usually non-minimal (in the global sense). If two nodes444Multipathing can be applied both at the switch and at the endpoint level. Thus, we use a term “node” to refer to switches or endpoints when a discussion is generic want to communicate using multiple paths, the sending node simply sends its data using paths residing in different layers. Note that multipathing is orthogonal to transport-level issues, and one can use different layers to transfer different flows between two nodes, but also different packets or flowlets within one flow [besta2020fatpaths]. In FatPaths, selecting links (when constructing layers) is done with simple random uniform sampling; a more elaborate scheme minimizing load imbalance is also provided. Layered routing is summarized in Fig. 5.

4.2 Proposed Multipath Routing: Summary

The central issue in layered routing is how to divide links into layers. We aim to minimize the number of layers (which minimizes the usage of HW resources in switches) while simultaneously maximizing the number of disjoint and almost-minimal paths between any switch pair (for more path diversity). Moreover, a detailed analysis from FatPaths indicates that – to maintain high performance in layered routing in virtually all low-diameter networks and traffic patterns – at least three disjoint paths per switch pair are needed [besta2020fatpaths]. Thus, the main goal of the layer construction algorithm is to find a minimum set of layers that together provide each switch pair with at least three disjoint paths while ensuring minimum overlap between specific layers. Ideally, these three paths include the minimal one (that always exists) and two “almost” minimal ones (in the following, an “almost” minimal path means a path that is longer by one hop than the minimal path between two given switches).

An overview of our proposed layer routing is shown in Fig. 5 (right). The key difference between our scheme and FatPaths is that we do not remove links from layers in order to ensure deadlock-freedom or to introduce non-minimal paths. Instead, we decouple deadlock resolution from layer creation, and explicitly construct paths satisfying the appropriate constraints on their count, non-minimal length, and well-balancedness. This facilitates creating layers that result in much higher throughput.

4.3 Generating Routing Layers

Our layer construction scheme is detailed in Algorithm 1. The input is the topology of inter-switch connections G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), and the desired number of layers |L|𝐿|L|| italic_L |. The output is a set of layers L𝐿Litalic_L, where each layer contains a collection of paths connecting different pairs of nodes. These paths together define a separate forwarding tree for each node.

The layer generation starts with assigning all links to layer 1. In layer 1111, we only use minimal paths, as we want to ensure that the single minimal path existing between all node pairs is included in at least one layer for each pair. Moreover, a matrix W𝑊Witalic_W and a priority queue p𝑝pitalic_p are initialized. These structures are used to find advantageous non-minimal paths for each node pair. Intuitively, a priority p(u,v)𝑝𝑢𝑣p(u,v)italic_p ( italic_u , italic_v ) of a node pair u,v𝑢𝑣u,vitalic_u , italic_v is determined by the number of non-minimal paths already assigned to u,v𝑢𝑣u,vitalic_u , italic_v (and maintained in other layers). The higher p(u,v)𝑝𝑢𝑣p(u,v)italic_p ( italic_u , italic_v ) is, the lower the priority of u,v𝑢𝑣u,vitalic_u , italic_v is. Hence, when looking for new non-minimal paths, node pairs with fewer paths assigned are prioritized. This facilitates balancing the number of advantageous paths across all pairs of nodes, to eliminate potential hotspots in the network.

Second, each entry W(r,s)𝑊𝑟𝑠W(r,s)italic_W ( italic_r , italic_s ) in matrix W𝑊Witalic_W describes the weight of a link between switches r,s𝑟𝑠r,sitalic_r , italic_s. This weight equals the number of paths (from any layer) that already use this link. The higher W(r,s)𝑊𝑟𝑠W(r,s)italic_W ( italic_r , italic_s ) is, the more paths use the corresponding link. Hence, when selecting new paths, we use W𝑊Witalic_W to balance numbers of paths across single links, minimizing risk of congestion. We also use W𝑊Witalic_W to balance the paths in the first layer to ensure minimal overlap of minimal paths.

Then, for every layer 2|L|2𝐿2\dots|L|2 … | italic_L |, and for each node pair in each layer, we find a single almost-minimal path that minimizes overlap with respect to paths already added to any other layer. For this, when finding paths in a layer l𝑙litalic_l, we first copy the current priorities of node pairs into a list that preserves the current state of priorities (copy_pairs). Here, node pairs with the same priority are in a random order, but come before any node pair with lower priority. Note that each node pair appears twice in the list, once for each direction. This enables using different paths when routing in different directions, further increasing the flexibility of path selection.

After that, we iterate over each node pair, in an attempt to construct a path for each such pair in each layer. Note that, in principle, it is possible that one cannot find a path for each node pair in each layer (we elaborate on dealing with such rare cases in § B.1; we resolve them with a simple fallback to a minimal path – our evaluation shows that this does not negatively impact throughput).

In each such iteration, we first use the find_path routine to try to find an almost-minimal path for a given node pair pair𝑝𝑎𝑖𝑟pairitalic_p italic_a italic_i italic_r, based on already inserted paths for that layer (specified in l𝑙litalic_l) and weights assigned to each link (specified in W𝑊Witalic_W). If we are able to find a valid path, we accordingly update priorities p𝑝pitalic_p (update_priorities) and link weights W𝑊Witalic_W (update_weights). Finally, we insert the path into layer l𝑙litalic_l (add_path_to_layer).

Input :  Network topology G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), number of layers |L|𝐿|L|| italic_L |
Result :  A set of L𝐿Litalic_L routing layers
 // WNr×Nr𝑊superscriptsubscript𝑁𝑟subscript𝑁𝑟W\in\mathbb{R}^{N_{r}\times N_{r}}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT contains weights of links; p𝑝pitalic_p is a priority queue, with entries being pairs of nodes
W𝑊Witalic_W = init_link_weight_matrix()
 // Set all matrix entries to 0
p𝑝pitalic_p = init_p_queue(G𝐺Gitalic_G)
 // Each node pair gets the same priority
L𝐿Litalic_L = {E}𝐸\{E\}{ italic_E }
 // Layer 0 contains all the links (E𝐸Eitalic_E)
1 for l=1𝑙1l=1italic_l = 1 to |L|1𝐿1|L|-1| italic_L | - 1 do
       init_layer(l𝑙litalic_l)
       // Initialize the next layer as empty
2       node_pairs𝑛𝑜𝑑𝑒_𝑝𝑎𝑖𝑟𝑠node\_pairsitalic_n italic_o italic_d italic_e _ italic_p italic_a italic_i italic_r italic_s = copy_pairs(p𝑝pitalic_p)
3       while node_pairs𝑛𝑜𝑑𝑒_𝑝𝑎𝑖𝑟𝑠node\_pairs\neq\emptysetitalic_n italic_o italic_d italic_e _ italic_p italic_a italic_i italic_r italic_s ≠ ∅ do
4             pair𝑝𝑎𝑖𝑟pairitalic_p italic_a italic_i italic_r = node_pairs𝑛𝑜𝑑𝑒_𝑝𝑎𝑖𝑟𝑠node\_pairsitalic_n italic_o italic_d italic_e _ italic_p italic_a italic_i italic_r italic_s.dequeue()
5             path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h = find_path(G𝐺Gitalic_G, W𝑊Witalic_W, pair𝑝𝑎𝑖𝑟pairitalic_p italic_a italic_i italic_r, l𝑙litalic_l)
6             if valid(path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h) then
7                   update_priorities(path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h, p𝑝pitalic_p)
8                   update_weights(path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h, W𝑊Witalic_W)
9                   add_path_to_layer(path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h, G𝐺Gitalic_G, l𝑙litalic_l)
10                  
11             end if
12            
13       end while
      L=L{l}𝐿𝐿𝑙L=L\cup\{l\}italic_L = italic_L ∪ { italic_l }
       // Add a new layer to finalized layers
14      
15 end for
16
Algorithm 1 Construct routing layers; details are in § 4.3

5 IMPLEMENTATION OF MULTIPATHING

The IB architecture [IBAspec] enables a high-speed switched fabric with HW support for RDMA [gerstenberger2013enabling, infiniband2014rocev2] and atomic operations [schweizer2015evaluating]. IB provides lossless destination-based packet forwarding that relies on link-level, credit-based flow control [Dally:2003:PPI:995703]. We now discuss the used IB features.

An IB network usually forms a single subnet consisting of physical IB switches and Host Channel Adapters (HCAs) that correspond to Ethernet NICs. All communication up to and including the transport layer is implemented within these two components.

Routing configuration is managed by a centralized subnet manager (SM). The SM configures connected IB devices, appropriately computes the forwarding tables to implement the used destination-based routing algorithm, and monitors the network for failures. Within an IB subnet, each HCA and each switch receive a unique local identifier (LID), assigned by the SM.

Each physical IB port has several independent virtual lanes (VLs). Each VL has its own receive and transmit buffers and flow control resources. There can be up to 15 VLs per physical port (depending on the equipment) and 1 VL for management traffic. Multiple VLs per port are used for deadlock freedom and to eliminate head-of-line blocking [Dally:2003:PPI:995703] (we discuss deadlocks in more detail in § 5.2).

Each switch provides a forwarding table called the Linear Forwarding Table (LFT) that – for a given packet – determines the outgoing port using the destination address (DLID) from the packet header. Then, for a given outgoing port, to determine the outgoing VL for a given packet, the switch uses a four-bit Service Level (SL) field from the packet header, in combination with the incoming and outgoing packet ports, to index into the SL-to-VL table. This enables packets to change virtual lanes at each hop and it allows for seamless utilization of switches with potentially different numbers of virtual lanes.

5.1 Routing

OpenSM, our choice of IB compliant SM, provides complete subnet information, including a list containing all nodes (switches, HCAs, routers) and ports, as well as the connections between them. We use this information to create and populate forwarding tables so that they implement the prescribed layered routing.

Multipathing In ECMP, each router stores multiple possible next-hops that each lie on a minimal path towards the destination. This approach of storing multiple next-hops for a given destination is not possible in IB. However, it can be emulated by assigning multiple LIDs to each HCA, a feature that we use to enable multipathing and to implement our layered routing in an IB setting. An HCA can receive a contiguous range of LID addresses. This range is determined by the so called LID Mask Control (LMC) value. Specifically, for an LMC equal x𝑥xitalic_x, each HCA port hosts a consecutive range of 2xsuperscript2𝑥2^{x}2 start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT LIDs. Then, one routes towards each such LID using a different path. We use the information provided by OpenSM to appropriately populate forwarding tables so that they implement the layered routing described in § 4.

Implementation of Layers We assign multiple addresses to each node; one address falls into one layer (each layer gets one address from each node). Hence, a layer is physically formed by the assigned addresses and the associated forwarding entries that route traffic to these addresses. The forwarding entries are set according to the specification of layers in the initialization phase. Our scheme for constructing layers provides a data structure port𝑝𝑜𝑟𝑡portitalic_p italic_o italic_r italic_t, which specifies the output port to be used for a packet traveling to a node d𝑑ditalic_d, from a switch s𝑠sitalic_s, within a layer l𝑙litalic_l; this output port is denoted with port[l][s][d]𝑝𝑜𝑟𝑡delimited-[]𝑙delimited-[]𝑠delimited-[]𝑑port[l][s][d]italic_p italic_o italic_r italic_t [ italic_l ] [ italic_s ] [ italic_d ].

Routing Within Layers The number of layers equals the number of addresses assigned to each node. Thus, we can treat the layer ID as the offset to the base (i.e., to the first) LID of each node. Hence, for instance, routing in the first layer (ID 00) uses the base LID of each node, whereas routing in the second layer uses the base LID plus offset 1.

Populating Forwarding Tables To populate forwarding entries, we add a value port[l][s][d]𝑝𝑜𝑟𝑡delimited-[]𝑙delimited-[]𝑠delimited-[]𝑑port[l][s][d]italic_p italic_o italic_r italic_t [ italic_l ] [ italic_s ] [ italic_d ] into the LFT of switch s𝑠sitalic_s, as the outgoing port number for packets being routed towards node d𝑑ditalic_d. As the destination address, we use the base LID of the node, increased by the offset l𝑙litalic_l, to ensure routing within layer l𝑙litalic_l. As the last step, we run a deadlock-resolution scheme that fills all SL-to-VL tables, eliminating the risk of deadlocks (cf. § 5.2).

5.2 Deadlock-Freedom

One downside of IB’s credit-based flow control ensuring losslessness is the possibility of deadlocks. Specifically, an IB network may enter a state in which packets in different buffers wait for each other indefinitely long to free the buffers, resulting in a deadlock. To overcome this, most routing schemes use different VLs to send packets [nue_routing, domke-deadlock-2011, schneider2016ensuring, Shim_2009, skeie2004lash, skeie2002layered]. By splitting a single port buffer into multiple independent logical VLs, one can break dependencies between waiting packets.

In FatPaths, each layer is acyclic, to ensure no deadlocks within each layer. However, this does not imply global deadlock-freedom on IB because of its lossless design based on channels. Specifically, one has to ensure that dependencies between packets using routes stored in any layers are also deadlock-free. Thus, we change the FatPaths approach by decoupling deadlock-avoidance from layer creation. Instead, we apply deadlock-removal after the layers are created. This also enables much more throughput because acyclic layers vastly restrict the choice of paths to be taken.

In our IB implementation, we propose and enable the use of two different deadlock-avoidance schemes. Firstly, if a sufficient number of VLs is available, we use the scheme introduced with the Deadlock-Free Single Source Shortest-Path (DFSSSP) [domke-deadlock-2011] algorithm, which is already integrated in IB. Intuitively, given a ready routing (i.e., the populated forwarding tables), DFSSSP first finds all dependencies that could lead to a deadlock, and then it iteratively accommodates these dependencies in a deadlock-free way, by assigning selected routes to use yet unoccupied VLs. If not enough VLs are available, the algorithm fails. If not all VLs are exhausted, DFSSSP additionally balances the number of paths using each VL, for more throughput.

By increasing the number of layers used, the total number of unique paths between node pairs increases, resulting in a higher number of virtual lanes (VLs) required to resolve deadlocks using the DFSSSP scheme. To maximize the number of supported layers, we propose a novel deadlock avoidance scheme based on the Duato’s approach [Duato:2002:INE:572400], that is agnostic to the number of layers and tailored for IB deployments that rely exclusively on paths of length <=3absent3<=3< = 3, such as those based on SF with our multipath routing method. The proposed algorithm ensures that the first, second, and third inter-switch hop of any path connecting two nodes use disjoint subsets of VLs. To achieve this, at least three VLs need to be available, and switches, for a given packet, must be able to identify their respective positions on the path using only the packet’s SL, incoming and outgoing port.

To illustrate the algorithm’s functionality, we consider each case individually. The first case, which involves paths of length 1 (sw1sw2𝑠subscript𝑤1𝑠subscript𝑤2sw_{1}-sw_{2}italic_s italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), can be solved trivially since sw1𝑠subscript𝑤1sw_{1}italic_s italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can determine that it is the first hop along the path by checking whether the incoming packet port is connected to an endpoint. This information can then be encoded easily in the SL-to-VL table.

The strategy to address the second case, paths of length 2 (sw1sw2sw3𝑠subscript𝑤1𝑠subscript𝑤2𝑠subscript𝑤3sw_{1}-sw_{2}-sw_{3}italic_s italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), is the same as the one for case three; therefore, we only present it once. In the third and final case, paths of length 3 (sw1sw2sw3sw4𝑠subscript𝑤1𝑠subscript𝑤2𝑠subscript𝑤3𝑠subscript𝑤4sw_{1}-sw_{2}-sw_{3}-sw_{4}italic_s italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_s italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT), we treat sw1𝑠subscript𝑤1sw_{1}italic_s italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as in case one but use a different approach to differentiate between sw2𝑠subscript𝑤2sw_{2}italic_s italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and sw3𝑠subscript𝑤3sw_{3}italic_s italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We establish a proper coloring of switches, using at most as many colors as there are available SLs. This color assignment is then mapped to SLs, ensuring each switch has a unique color and SL among its neighbours. By setting the SL of a packet routed along a path of length 2 or 3 to the SL assigned to the second switch (sw2𝑠subscript𝑤2sw_{2}italic_s italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) along that path, it is guaranteed that the packet’s assigned SL matches the SL of the second hop but not the SL of the third. Subsequently, if a switch is neither the first nor last hop on a path – a condition trivially determined through the incoming and outgoing packet ports – then the switch’s position along the path can be ascertained by whether the incoming packet’s assigned SL matches the SL assigned to the switch. Specifically, if the SLs match, then the given switch must be the second hop; if they don’t, then it must be the third. Thus, we can differentiate the second hop from a potential third hop and select the appropriate subset of VLs at each hop accordingly.

If fewer than 3333 VLs are available or no proper coloring using the available SLs can be established, the algorithm fails. Similar to the DFSSSP scheme, the disjoint VL subsets can be chosen to balance the number of paths crossing each VL.

5.3 Load Balancing

For load balancing, we rely on the respective protocol higher up in the stack to choose a layer out of the set of possible ones available for a given destination. In our case, this is the Open MPI [gabriel2004open] implementation of the Message Passing Interface (MPI) standard [clarke1994mpi]. Open MPI serves as a communication library and directly interfaces with the IB networking API (Verbs). To optimize traffic flow, we utilize Open MPI’s default load balancing technique, which distributes traffic evenly across the available paths using a round-robin selection process. More advanced, adaptive schemes can seamlessly be used by changing the selection policy.

For fault tolerance, we rely on IB’s subnet manager. We stress that our routing can be seamlessly used with other transport schemes besides the ones used in the deployed cluster.

5.4 Path Diversity vs. Network Size

Increasing the number of different paths between each node pair requires more layers and thus also more addresses assigned to each node (i.e., a larger LMC value). However, using more addresses within one node decreases the maximum number of nodes that can be used in the network overall (because the address field size is fixed to 16 bits). We analyze this tradeoff in Tab. 2. We assume the maximum SF network based on {36, 48, 64}-port switches, that guarantees full global bandwidth. The results illustrate that one can use 4444 layers without having to make any compromises on the networks size, but anything beyond 4444 layers would reduce the maximum network size. At this point, the constraining factor is no longer the switch radix, but the address space. In § 6 and § 7, we show that – fortunately – our routing scheme’s performance is already quite substantial with just 4 layers and does not need more than 8 layers for high performance.

Table 2: Maximum number of switches and servers supported by a single-subnet, full global bandwidth, SF-based IB network, with “#A”=2LMC“#A”superscript2𝐿𝑀𝐶\textbf{``\#A''}=2^{LMC}“#A” = 2 start_POSTSUPERSCRIPT italic_L italic_M italic_C end_POSTSUPERSCRIPT many addresses per node.
36-port switches 48-port switches 64-port switches
#A Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT N𝑁Nitalic_N ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT p𝑝pitalic_p Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT N𝑁Nitalic_N ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT p𝑝pitalic_p Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT N𝑁Nitalic_N ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT p𝑝pitalic_p
1 512 6144 24 12 882 14112 31 16 1568 32928 42 21
2 512 6144 24 12 882 14112 31 16 1250 23750 37 19
4 512 6144 24 12 800 12000 30 15 800 12000 30 15
8 450 5400 23 12 450 5400 23 12 450 5400 23 12
16 288 2592 18 9 288 2592 18 9 288 2592 18 9
32 162 1134 13 7 162 1134 13 7 162 1134 13 7
64 98 588 11 6 98 588 11 6 98 588 11 6
128 72 360 9 5 72 360 9 5 72 360 9 5
Refer to caption
Figure 6: Histograms of average path lengths and maximum path lengths across all layers for each switch pair.
Refer to caption
Figure 7: Histograms (bin size = 20) of counts of paths crossing each individual link.
Refer to caption
Figure 8: Histograms of counts of disjoint paths for different switch pairs.
Refer to caption
Figure 9: Maximum achievable throughput for the adversarial traffic pattern under three different injection loads (fraction of communicating endpoint pairs).

6 THEORETICAL ANALYSIS

We conduct a theoretical analysis of the developed routing protocols using the deployed SF network as a case study. We focus on how well our routing uses the diversity of non-minimal paths, which is necessary for high performance [besta2020fatpaths].

Baselines and Parameters

We analyze our layered routing that minimizes path overlap ( § 4) and compare it to a simple random layer construction (RUES, Random Uniform Edge Selection) and to the state-of-the-art FatPaths scheme [besta2020fatpaths].

We vary different parameters, including the fraction p𝑝pitalic_p of preserved links in a layer, which refers to the proportion of links from the network that are included in each layer for the RUES scheme (specifically, we consider p=40%𝑝percent40p=40\%italic_p = 40 %, p=60%𝑝percent60p=60\%italic_p = 60 %, and p=80%𝑝percent80p=80\%italic_p = 80 %), and the number of layers used. We focus on the deployed SF with 50 switches, but the results generalize to larger sizes. Overall, we show that the proposed layered routing is superior to the state-of-the-art in crucial metrics: lengths, distribution, and diversity of used paths, and the achieved throughput.

6.1 Path Lengths

The first important metric for evaluating routing is the length of paths constructed using the proposed routing schemes. Specifically, when routing in SF, one wants to use the single available minimal path (with 1 or 2 hops, depending on picked switch pairs) and the “almost” minimal ones – with 3 hops – as indicated in the FatPaths study [besta2020fatpaths]. To analyze whether the considered routing ensures this, we compute the average and maximum lengths of the set of paths connecting each individual switch pair, as produced by the respective routing schemes. Fig. 6 shows the analysis results.

Our novel layered scheme outperforms all others, because it ensures that the highest fraction of switch pairs uses the “almost” minimal paths of length at most 3. The downside of RUES is that the more randomness is employed, the larger the maximum path length becomes. For a sampling factor p=80%𝑝percent80p=80\%italic_p = 80 %, there is no switch pair with a path of length more than 4444, whereas for p=40%𝑝percent40p=40\%italic_p = 40 % some switch pairs have paths of length greater than 8888. This indicates large differences in path lengths in different layers for some switch pairs, even if the average path length is between 3333 and 4444. This can negatively impact load balancing efforts as it becomes more difficult to predict path latency. Then, in FatPaths, large fractions of switch pairs use paths of length 2, which means that these links may likely become congested.

Doubling the number of layers does not change the overall trends and it has mostly no effect on the average path length distributions. Only the maximum path lengths display a small shift to the right. This is because using more layers increases the probability of finding a longer path.

6.2 Path Distribution

We now count the total number of paths that cross each individual link, see Fig. 7. Our layered routing ensures a balanced scenario, i.e., close to equal utilization of each link. This corresponds to a “single bar”, i.e., the “tighter” the distribution the better balanced the paths are.

Similarly to the analysis on path length, less randomness leads to better results, which is expected because as layers become less dense, the links that are present will be more utilized. Hence, any link that by chance is included in more than an average number of layers will have a higher number of crossing paths and vice versa. FatPaths performs similarly to RUES for a sampling factor of p=80%𝑝percent80p=80\%italic_p = 80 %. The distributions for 8888 layers are slightly shifted to the right compared to 4 layers, as they have twice as many paths.

6.3 Path Diversity

Two paths are disjoint if they do not share common links. In layered routing, we aim to maximize the number of such paths used by node pairs. Fig. 8 displays counts of disjoint paths between switch pairs. The FatPaths layer construction based on minimizing path overlap underperforms because of its acyclic layers. Moreover, unlike in previous analyses, more randomness (and thus sparser layers) leads to better result for RUES. For a sampling factor of p=40%𝑝percent40p=40\%italic_p = 40 % and 8888 layers, \approx97.5% of switch pairs have at least the 3333 desired disjoint paths. This is the best performing algorithm out of the ones considered. However, this comes at the expense of disadvantageous path lengths and path distribution.

Our scheme does not need to make a similar trade-off because with 8888 layers already around 88.5%percent88.588.5\%88.5 % of switch pairs have at least 3333 disjoint paths, which we have verified to grow to almost 100%percent100100\%100 % percent when scaling to the next higher configuration that uses 16161616 layers. At the same time, the lengths and path distributions over links are highly beneficial.

6.4 Maximum Achievable Throughput

We also analyze the maximum achievable throughout (MAT). MAT is defined as the maximum fraction of traffic demands from all endpoint pairs that can be accommodated simultaneously, while adhering to network and routing constraints. For example, a throughput of 1.51.51.51.5 denotes that the network can sustain 1.51.51.51.5 times the traffic demand of each communicating node pair simultaneously.

Here, we consider an adversarial traffic pattern, which maximizes stress on the interconnect by incorporating several large elephant flows between endpoints that are separated by more than one inter-switch hop, and combining these large flows with many small flows [jyothi2016measuring]. We use TopoBench [jyothi2016measuring], a throughput evaluation tool which relies on linear programming to compute MAT. The results are displayed in Fig. 9.

Our algorithm outperforms FatPaths for different traffic intensities and layer counts. This is most important for a small number of layers, which is key for routing on IB hardware as using many layers reduces the supported network sizes (cf. Tab. 2). Our layered routing experiences diminishing returns beyond 16 layers. This is expected, as almost 100%percent100100\%100 % of endpoint pairs have at least 3333 disjoint paths for 16161616 layers (one needs at least that many disjoint paths to ensure high performance with non-minimal routing). Before diminishing returns set in, FatPaths requires 8×8\times8 × as many layers to reach equivalent performance, making our design much more practical.

6.5 Insights & Takeaways - Theoretical Results

Our novel IB layered routing achieves superior performance in all considered path quality measures and especially in MAT. Almost around 60%percent6060\%60 % of switch pairs have at least 3333 disjoint non-minimal paths when using only 4444 layers, which grows to 88.5%percent88.588.5\%88.5 % with 8888 layers. Furthermore, we achieve the most balanced distribution of paths over the links in the network. FatPaths performs similarly in terms of average and maximum path lengths, but underperforms in the available number of disjoint paths per switch pair. For RUES, a sampling factor of p=60%𝑝percent60p=60\%italic_p = 60 % achieved the most balanced results across all metrics, but RUES performs much worse in comparison to FatPaths and our work overall.

Table 3: Workload Configurations.
Workload Configuration # Nodes (N) Scaling Metric
Custom Alltoall Message Sizes: 1111B \to 4444MiB 2,4,8,16,32,64,128,2002481632641282002,4,8,16,32,64,128,2002 , 4 , 8 , 16 , 32 , 64 , 128 , 200 Weak Bandwidth [MiB/s]
IMB Bcast/Allreduce [IMB-benchmarks] Message Sizes: 1111B \to 32323232MiB 2,4,8,16,32,64,128,2002481632641282002,4,8,16,32,64,128,2002 , 4 , 8 , 16 , 32 , 64 , 128 , 200 Weak Bandwidth [MiB/s]
eBB [hoefler2008switches] Message Size: 128128128128MiB 2,4,8,16,32,64,128,2002481632641282002,4,8,16,32,64,128,2002 , 4 , 8 , 16 , 32 , 64 , 128 , 200 Strong Bandwidth [MiB/s]
CoMD [comd] 1003superscript1003100^{3}100 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Atoms per Process 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Time [s]
FFVC [ffvc] 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Cuboid per Process for 64absent64\leq 64≤ 64 processes, else 643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Time [s]
mVMC [mvmc] Unmodified job_middle weak-scaling test 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Time [s]
MILC [milc-modeling, milc] benchmark_n8 Input 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Time [s]
NTChem [ntchem] taxol Model 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Strong Time [s]
BFS16 [graph500, 7840705] # Vertices: 223superscript2232^{23}2 start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT, 224superscript2242^{24}2 start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPT, 225superscript2252^{25}2 start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT, 226superscript2262^{26}2 start_POSTSUPERSCRIPT 26 end_POSTSUPERSCRIPT Avg. Degree: 16161616 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Giga-Traversed Edges per Second [GTEPS]
BFS128 [graph500, 7840705] # Vertices: 223superscript2232^{23}2 start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT, 224superscript2242^{24}2 start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPT, 225superscript2252^{25}2 start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT, 226superscript2262^{26}2 start_POSTSUPERSCRIPT 26 end_POSTSUPERSCRIPT Avg. Degree: 128128128128 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Giga-Traversed Edges per Second [GTEPS]
BFS1024 [graph500, 7840705] # Vertices: 223superscript2232^{23}2 start_POSTSUPERSCRIPT 23 end_POSTSUPERSCRIPT, 224superscript2242^{24}2 start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPT, 225superscript2252^{25}2 start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT, 226superscript2262^{26}2 start_POSTSUPERSCRIPT 26 end_POSTSUPERSCRIPT Avg. Degree: 1024102410241024 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Giga-Traversed Edges per Second [GTEPS]
HPL [hpl] Matrix A𝐴Aitalic_A 1 GiB ,1 GiB ,1 GiB  and 0.25 GiB absent1 GiB 1 GiB 1 GiB  and 0.25 GiB \approx 1\text{ GiB },1\text{ GiB },1\text{ GiB }\text{ and }0.25\text{ GiB }≈ 1 GiB , 1 GiB , 1 GiB and 0.25 GiB pre Process 25,50,100,200255010020025,50,100,20025 , 50 , 100 , 200 Weak Giga-Floating point OP/s [GFLOPS]
ResNet152 [he2015deep, hoefler2022hamming] Pure Data Parallelism 40,80,120,160,200408012016020040,80,120,160,20040 , 80 , 120 , 160 , 200 Weak Iteration Time [s]
Cosmoflow [hoefler2022hamming, mathuriya2018cosmoflow] Model Shards: 4444 Data Shards: # Nodes4# Nodes4\frac{\text{\# Nodes}}{4}divide start_ARG # Nodes end_ARG start_ARG 4 end_ARG 40,80,120,160,200408012016020040,80,120,160,20040 , 80 , 120 , 160 , 200 Weak Iteration Time [s]
GPT-3 [brown2020language, hoefler2022hamming] Pipeline Stages (layers): 10101010 Model Shards: 4444 Data Shards: # Nodes40# Nodes40\frac{\text{\# Nodes}}{40}divide start_ARG # Nodes end_ARG start_ARG 40 end_ARG 40,80,120,160,200408012016020040,80,120,160,20040 , 80 , 120 , 160 , 200 Weak Iteration Time [s]
Refer to caption
(a) MPI Bcast - SF L vs. FT
Refer to caption
(b) MPI Allreduce - SF L vs. FT
Refer to caption
(c) Custom Alltoall - SF L vs. FT
Refer to caption
(d) eBB - SF L vs. FT
Figure 10: Relative performance difference of SF (linear placement strategy) over FT for various Microbenchmarks; eBB performance of SF L in comparison to maximum bandwidth and FT performance (higher is better), including routing improvement of this work over DFSSSP (heatmap).
Refer to caption
(a) MPI Bcast - SF R vs. FT
Refer to caption
(b) MPI Allreduce - SF R vs. FT
Refer to caption
(c) Custom Alltoall - SF R vs. FT
Refer to caption
(d) eBB - SF R vs. FT
Figure 11: Relative performance difference of SF (random placement strategy) over FT for various Microbenchmarks; eBB performance of SF R in comparison to maximum bandwidth and FT performance (higher is better), including routing improvement of this work over DFSSSP (heatmap).
Refer to caption
Figure 12: Runtime of scientific workloads (lower is better) - SF L vs. FT
Refer to caption
Figure 13: Performance of HPC benchmarks (higher is better) - SF L vs. FT
Refer to caption
Figure 14: Iteration time of DNN proxy workloads (lower is better) SF L vs. FT and routing improvement of this work over DFSSSP (heatmap) for SF L.

7 EVALUATION

We now illustrate the feasibility of our SF installation by evaluating a broad set of applications from numerous domains against a comparable FT installation.

7.1 2-Level Non-Blocking Fat-Tree

FT topologies have historically been the usual choice for large-scale computing systems, largely due to their predictable behavior and full-bandwidth capabilities, when configured in a non-blocking manner. However, their high cost often leads to oversubscribed deployments at the tree’s lowest level, reducing construction costs at the expense of bisection bandwidth.

To ensure a fair performance comparison with our SF installation, we construct a 2-level non-blocking FT, reusing the same hardware. The FT and SF both share the same network diameter and full-bandwidth capabilities. Our FT configuration employs 6 core and 12 leaf switches, compatible with our 36-port switches. Each leaf switch connects to each core switch through 3 links, and the remaining ports link to evenly distributed endpoints. This configuration supports up to 216 endpoints, making the FT marginally under-subscribed and thus strengthening the fairness of our comparison.

7.2 Workloads & Configurations

We utilize a significant subset of the benchmarks included in the TSUBAME2 HyperX (t2hx) benchmark suite [domke_hyperx_2019] and enhance them with a custom implementation of MPI_Alltoall555Details on the performance improvements for the custom alltoall collective, over the default, can be found in the appendix (Sec C.1)., as well as three DNN proxies introduced by Hoefler et al. [hoefler2022hamming]. The configuration of each benchmark is provided in Tab. 3. Our analysis includes three classes of benchmarks:

Microbenchmarks We evaluate the system’s bandwidth using Intel MPI Benchmarks’ (IMB) measurements of the Allreduce and Bcast collectives [IMB-benchmarks], and a custom alltoall. We also assess the effective bisection bandwidth (ebb) of the system using Netgauge’s eBB benchmark [hoefler2008switches].

Scientific Application & HPC Benchmarks We evaluate a wide range of benchmarks, covering various scientific applications, all of which are listed in Tab. 3 and taken directly from the t2hx benchmark suite. We also analyze the performance of the High Performance Linpack (HPL) [hpl] benchmark and of the breadth-first search (BFS) [besta2017slimsell] in the Graph 500 Benchmark [graph500]. Additionally, we extend the BFS performance analysis by changing the average degree of the vertices (edgefactor), while scaling the number of vertices linearly with the number of participating compute nodes. Specifically, we consider edgefactors 16161616, 128128128128 and 1024102410241024.

DNN Proxies The DNN proxies evaluated on SF include ResNet152 [he2015deep] (pure data parallelism), CosmoFlow [mathuriya2018cosmoflow] (data and operator parallelism) and GPT-3 [brown2020language] (data, operator, and pipeline parallelism), as outlined in Tab. 3. For GPT-3, each pipeline stage processes one DNN-layer.

7.3 Execution Environment

To ensure consistency and reproducibility, all benchmarks were compiled using GCC v4.8.5 and executed using OpenMPI v1.10.7. We use one MPI rank per node and assign one OpenMP thread per physical core on Socket 1 of the dual-socket system (pinning on Socket 2 introduces non-negligible slowdowns due to inter-socket communication).

We investigate two MPI rank placement strategies: linear and random. The linear strategy places rank j𝑗jitalic_j on node j𝑗jitalic_j, a commonly used approach that enhances latency and traffic locality, especially for FTs  [michelogiannakis2017aphid, slurm132]. This strategy also models a system with minimal fragmentation. In contrast, the random strategy represents systems with significant fragmentation. It randomizes rank placement to potentially reduce network bottlenecks on SF, albeit at the cost of increased latency. For FT, the linear placement significantly outperformed its random counterpart in all microbenchmarks and exhibited comparable performance in the remaining tests. Consequently, we report SF performance relative to the FT’s linear placement only.

Each benchmark configuration is executed five times; microbenchmarks are executed for at least 100 iterations. We assess all SF benchmarks using our new multipath routing algorithm based on both minimal and almost minimal paths, as well as the defacto standard multipath routing algorithm in IB (DFSSSP), that leverages minimal paths only [domke-hoefler-dfsssp]. We instantiate each routing algorithm once with 1, 2, 4, and 8 layers, respectively, but only report the results of the best-performing variant for each benchmark configuration. For all FT benchmarks we choose the commonly used ftree routing [Jacobs2010DModKRP]. Mean and standard deviation of the results are reported, with the latter indicated using red error bars for all bar plots. Relative performance differences of SF over FT are annotated above each bar. Any significant performance gains or losses of our novel routing algorithm in comparison to DFSSSP for any benchmark are either explicitly stated in the text or visualized using heatmaps.

In the main text, we present comprehensive results for SF using the linear placement strategy, and include only microbenchmark results for the random placement strategy due to space considerations. Detailed results of the random strategy for other benchmarks, which largely mirror those obtained with the linear strategy, are in Appendix C.

7.4 Microbenchmarks

Fig. 10(a)10(c) illustrate the relative performance differences of SF with linear placement over FT and Fig. 11(a)11(c) of SF with random placement over FT for MPI collectives bcast, allreduce, and custom alltoall.

Generally, SF’s performance using the linear placement strategy closely matches that of the FT, with FT only displaying minor advantages in bcast and allreduce for 8 and 16 node configurations at smaller, latency-sensitive message sizes. This marginal edge of FT in specific configurations is due to its architecture, wherein leaf switches connect to at least 16 nodes, facilitating localized communication with zero inter-switch hops, thus minimizing latency. While SF, under linear placement, enjoys the benefits of zero inter-switch hops mostly for configurations of up to 4 nodes – owing to its design of connecting 4 nodes per switch – random placement generally does not benefit from this localized communication advantage. As a result, SF experiences marginally lower performance in comparison to FT for these latency-sensitive scenarios with the random placement strategy.

In contrast, for the communication-intensive alltoall collective, SF’s performance closely mirrors, or even slightly surpasses, that of the FT for small message sizes when employing the linear placement strategy (cf. Fig. 10(c)). However, in 8, 16, and 32 node configurations, particularly with bandwidth-critical message sizes, SF lags due to congestion caused by all inter-switch communication occurring between 2, 4, or 8 switches, respectively. This leads to traffic bottlenecks on the often single shortest path between these switches. While our new routing scheme, as discussed in § 6, theoretically mitigates this congestion, the absence of adaptive load balancing limits practical improvements to at most 7%percent77\%7 % over DFSSSP.

Switching to the random placement strategy markedly improves SF’s performance for the alltoall collective, as shown in Fig. 11(c). This strategy not only overcomes the noted bottlenecks but also enables SF to significantly outperform FT. This improvement results from the random placement strategy’s enhanced traffic distribution across the network, showcasing the trade-off between increased latency for smaller message sizes and superior traffic balancing within the SF topology. These findings imply that the integration of adaptive load balancing with our routing scheme could effectively address the congestion issues identified with linear placement, underscoring the potential of our routing scheme to optimize network performance for demanding communication patterns.

Lastly, in Fig. 10(d) and Fig. 11(d), we present the ebb across various node counts for the linear and random placement strategy, respectively. At maximum node count we achieve approximately half of the injection bandwidth, equating to 75%percent7575\%75 % of the theoretical bisection bandwidth optimum [besta2014slim], with both strategies. Though the FT matches SF’s full-system ebb, it outperforms SF with linear placement for the 8, 16, and 32 nodes configurations. This discrepancy mirrors the observations for the alltoall collective and is similarly overcome with the random placement strategy (cf. Fig. 11(d)).

In the right section of both Fig. 10(d) and Fig. 11(d), heatmaps display the performance gains of our new routing scheme over DFSSSP for the eBB benchmark. Notably, for the linear placement strategy, improvements of up to 28%percent2828\%28 % are observed for the earlier described node configurations, which are especially prone to congestion. Under the random placement strategy, the level of improvement is less significant, with only up to 7%percent77\%7 %, suggesting that this strategy’s primary advantage lies in its ability to distribute traffic more evenly, even in the absence of adaptive load balancing.

7.5 Scientific Workloads & HPC Benchmarks

In Fig. 12, we present the runtime and relative performance of the solver/kernel for each of the scientific workloads on SF, using the linear placement strategy. The scaling behavior of each workload, based on their configurations detailed in Tab. 3, is evident. Notably, the drop in runtime for FFVC when scaling from 50 to 100 nodes is due to the decrease in the workload’s problem size when running on >64absent64>64> 64 nodes. Utilizing almost minimal paths in combination with minimal paths does not generate any significant speedup for these workloads over pure minimal routing (DFSSSP), and generally results in only small runtime variances of <1%absentpercent1<1\%< 1 %. This is due to the communication time only constituting a small fraction of the overall runtime for these scientific workloads [domke_hyperx_2019, 10.1007/978-3-319-58667-0_12].

Fig. 13 shows the performance of the HPC benchmarks, which display similar weak-scaling behavior as the scientific workloads. HPL exhibits almost linear scaling performance when increasing the number of nodes from 25252525 to 50505050 or 100100100100 nodes, indicating that the overhead introduced by the increased amount of communication is negligible. Consistent with these results, introducing almost minimal paths to the routing impacts performance by less than 1%percent11\%1 % for the HPL benchmark. The only exception is the 200200200200 node setting, where the decrease in the problem size (per node) is likely the main cause for the deviation from the linear scaling observed.

In the case of the Graph 500 - BFS benchmark, we experienced high variance with the default implementation. To mitigate this, we fixed the seed for the graph generation and used the same source vertex for each BFS run. The BFS scaling results show more fluctuations in comparison to the HPL results, particularly for the sparser variant. This is accompanied by greater variability in speedup through almost minimal paths, which ranged from -5%percent55\%5 % to +1%percent11\%1 %. It is not clear whether this can be attributed purely to network communication or to other factors such as caching effects and system noise.

Overall, our experiments show SF competes effectively with FT in terms of performance, while being very effective for scaling both scientific workloads and HPC benchmarks, even when limited to minimal paths.

7.6 Deep Learning Workloads

The left part of Fig. 14 shows the runtime and relative performance of the DNN proxies when linearly increasing the number of nodes from 40404040 to 200200200200. ResNet152 with pure data parallelism only requires allreduce for gradient aggregation. CosmoFlow with a hybrid of data and operator parallelism requires allgather, reduce-scatter, allreduce, and point-to-point communications. GPT-3 with a hybrid of data, operator, and pipeline parallelism requires allreduce and point-to-point communications. As we increase the data shards proportionally to the number of nodes, the scalability is mainly determined by allreduce across the data dimension.

We find that CosmoFlow’s runtime on SF is comparable to that on FT. In contrast, GPT-3 notably performs better on SF for configurations with 160 and 200 nodes, while ResNet152 begins to lag as the node count increases. Although both GPT-3 and ResNet152 predominantly rely on allreduces at higher node counts, their diverging performance trends can be attributed to differences in message sizes; GPT-3 handles significantly larger messages than ResNet152. Expectedly, the performance trend of GPT-3 matches the trend of MPI Allreduce for the high node count configurations (cf. Fig. 10(b)).

The right part of Fig. 14 shows that our work generally outperforms DFSSSP for GPT-3, with up to 24%percent2424\%24 % improvements.

7.7 Insights & Takeaways - Empirical Results

When analyzing communication-intensive workloads on configurations with 8, 16, or 32 nodes, we identified some congestion challenges. These challenges stemmed from the non-adaptive nature of the path selection. However, by employing a random placement strategy, these issues were effectively counteracted. Our findings subsequently indicate that SF consistently achieves performance on par with, or even surpassing, the well-established FT topology, particularly under conditions of full-system utilization. Additionally, SF displays effective scaling capabilities across a diverse range of workloads. In comparison to the established DFSSSP, our novel routing approach exhibited promising performance, registering improvements of over 20%percent2020\%20 %.

7.8 Scalability & Cost Analysis

Table 4: Maximal scalability and costs of SF deployments compared to non-blocking FT2, FT2 oversubscribed by 3 (FT2-B), FT3 and 2-D HyperX (HX2) under given port constraints. For the fixed size cluster we use 64-port switches for the FT2 and FT-B, 40-port switch for HX2, and 36-port for SF and FT3.
36-port switches 40-port switches 64-port switches 2048 nodes clusters
FT2 FT2-B FT3 HX2 SF FT2 FT2-B FT3 HX2 SF FT2 FT2-B FT3 HX2 SF FT2 FT2-B FT3 HX2 SF
Endpoints 648 972 11664 2028 6144 800 1200 16000 2744 7514 2048 3072 65536 10648 32928 2048 2048 2048 2197 2178
Switches 54 45 1620 169 512 60 50 2000 196 578 96 80 5120 484 1568 96 59 303 169 242
Links 648 324 23328 2028 6144 800 400 32000 2548 7225 2048 1024 131072 10164 32928 2048 344 4320 2028 2057
Costs [M$] 1.5 1.1 45 4.5 13.8 2.4 1.7 84.2 7.8 22.4 9 7.2 491 45.5 146 7.5 2.7 8.3 6.4 5.8
Cost/Endp [k$] 2.2 1.2 3.8 2.2 2.2 3 1.5 5.2 2.8 2.9 4.4 2.3 7.5 4.3 4.4 3.6 1.3 4 3.1 2.8

FT topologies are the preferred choice for large-scale HPC deployments due to their adaptability, adoptable bisection bandwidth, established routing, and isolation advantages. These properties often benefit application performance consistency  [stunkel2020high, varrette2022aggregating, bhatele2019analyzing]. However, their low-diameter configurations do not scale as well as contemporary topologies [kathareios2015cost].

We compare the scalability and deployment cost of 2-level FTs (FT2), 3-level FTs (FT3), 2-D HyperX (HX2) [domke_hyperx_2019, Ahn:2009:HTR:1654059.1654101], and SF. Our evaluation, summarized in Tab. 4, includes both the non-blocking FT2 variant and its 3:1 oversubscribed version (FT2-B). The pricing details are in Appendix D.

Scalability

We show that SF networks offer a distinct advantage in scalability by evaluating maximum network size for a HW setup with 36, 40, and 64-port switches. SF can accommodate approximately 10, 6, and 3 times more endpoints than FT2, FT2-B, and HX2 respectively, while maintaining a lower or comparable cost-to-endpoint ratio and the same network diameter of 2. FT3 can accommodate more endpoints than SF, however, this comes at a significantly larger (around 1.75x) cost-to-endpoint ratio and increased network diameter which has an impact on a performance of latency critical applications. This makes SF a compelling choice for large-scale diameter-2 deployments.

Cost

When the number of endpoints is predetermined, SF’s requirement for fewer port switches can reduce overall deployment costs, while kee** comparable benchmark performance to FT2 as shown in § 7. Tab. 4 further shows an example of fixing a cluster requirement to 2048 endpoints. Realising such a cluster using SF in comparison to FT2, HX2, and FT3 results in absolute cost saving of $1.7M, $0.6M, and $2.5M respectively. While using FT2-B might be cheaper in this scenario, it does not provide the full bandwidth property as SF, FT2, HX2, and FT3.

8 RELATED WORK

Our work touches on different areas. We now outline related works, excluding the ones covered in past sections.

Network Topologies

Several recent networks build upon SF. This includes Megafly [flajslik2018megafly], Bundlefly [bundlefly_2020], Galaxyfly [lei2016galaxyfly], and Xpander [valadarsky2015]. Yet, they do not provide diameter-2 and thus none of them are competitive with SF in latency, cost, or power consumption, as observed by recent results [besta2020fatpaths]. Although PolarFly has shown promising results in recent studies, its advantages over SF can be attributed to the diligent design of routing protocols that leverage its structure to guarantee optimal routing decisions [lakhotia2022polarfly, lakhotia2023innetwork]. Some recent designs based on similar principles target on-chip networks only [besta2018slim, iff2022sparse].

Physical Interconnect Installations

The majority of works on interconnects use simulations for evaluation [besta2014slim, dally08, dally07, valadarsky2015, flajslik2018megafly, bundlefly_2020, lei2016galaxyfly, singla2012jellyfish, ahn2009hyperx, DBLP:conf/isca/KoibuchiMAHC12, besta2021towards]. However, some topologies have been evaluated with real installations. This includes – for example – HyperX [domke_hyperx_2019] and Dragonfly [aries]. Here, we offer the first real evaluation of Slim Fly.

Congestion Control & Load Balancing

In general, we do not focus on transport protocols (flow, congestion). Here, we rely on mechanisms from the FatPaths [besta2020fatpaths] architecture. In layered routing, traffic is balanced across layers. We use simple randomized and round-robin schemes, which results in high performance. Other schemes could also be incorporated, including load balancing based on flows [hopps2000analysis, curtis2011mahout, rasley2014planck, sen2013localflow, tso2013longer, benson2011microte, zhou2014wcmp, al2010hedera, kabbani2014flowbender], flowcells [he2015presto], flowlets [katta2016clove, alizadeh2014conga, vanini2017letflow, katta2016hula, kandula2007dynamic], and single packets [zats2012detail, handley2017re, dixit2013impact, cao2013per, perry2015fastpass, raiciu2011improving].

9 CONCLUSION

Slim Fly (SF) is the first network topology that lowered cost and improved performance by reducing the network diameter to two, promising significant improvement over established interconnects. However, it has not yet been tested in practice. We address this by deploying the first at-scale SF installation and establishing and implementing open-source routines for cabling and physical layout, to guide future deployments and effectively verify cabling. This can foster the adoption of SFs in broad industry and facilitate practical deployments of other low-diameter topologies, including the most recent ones, such as PolarFly or Bundlefly.

We further introduce a novel high-performance routing scheme that improves upon state of the art, achieving up to 24%percent2424\%24 % speedup for the evaluated deep neural network (DNN) workloads over the standard IB multipath routing algorithm (DFSSSP) through non-minimal paths.

We use the first practical, real-world deployment of SF to demonstrate the topology’s ability to scalably process a wide selection of modern workloads such as distributed DNN training, graph analytics, or linear algebra kernels. It consistently matches or surpasses the performance of a comparable non-blocking Fat Tree (FT) deployment for a wide selection of workloads, for example, achieving a 66% speedup for distributed deep neural network training. Importantly, SF simultaneously delivers superior scalability. For example, it enables connecting between 3×3\times3 × and 10×10\times10 × the number of servers compared to other diameter-2 topologies like 2-level FT and 2-D HyperX, while maintaining both a comparable cost-to-endpoint ratio and full bandwidth. For larger installation sizes, SF’s scalability translates to significant cost advantages, for example, 50% over full bandwidth non-blocking 3-level Fat Tree configurations [besta2014slim]. Overall, this effort will spearhead future research into more powerful network topologies.

Acknowledgments

We thank Colin McMurtrie, Mark Klein, Angelo Mangili, and the whole CSCS team granting access to the Ault and Daint machines, and for their excellent technical support with the Slim Fly cluster infrastructure. We thank Timo Schneider for help with infrastructure at SPCL. This project received funding from the European Research Council (Project PSAP, No. 101002047), and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No. 955513 (MAELSTROM). This project received funding from the European Union’s HE research and innovation programme under the grant agreement No. 101070141 (Project GLACIATION). This project was supported by JSPS KAKENHI Grant Number JP19H04119.

\printbibliography

Appendix A Details of Slim Fly Construction

A.1 Selecting Topology Size, Parametrizing Input

Overall, one first chooses a prime power q𝑞qitalic_q that satisfies the equation q=4w+δ𝑞4𝑤𝛿q=4w+\deltaitalic_q = 4 italic_w + italic_δ for some δ{1,0,1}𝛿101\delta\in\{-1,0,1\}italic_δ ∈ { - 1 , 0 , 1 } and w𝑤w\in\mathbb{N}italic_w ∈ blackboard_N. q𝑞qitalic_q is an input parameter that determines the whole topology structure. For example, the number of vertices (switches) is Nr=2q2subscript𝑁𝑟2superscript𝑞2N_{r}=2q^{2}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 2 italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the network radix k=3qδ2superscript𝑘3𝑞𝛿2k^{\prime}=\frac{3q-\delta}{2}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 3 italic_q - italic_δ end_ARG start_ARG 2 end_ARG. In our case, Nr=50subscript𝑁𝑟50N_{r}=50italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 50, thus q=5𝑞5q=5italic_q = 5, which satisfies the equation q=4w+δ𝑞4𝑤𝛿q=4w+\deltaitalic_q = 4 italic_w + italic_δ for w=1𝑤1w=1italic_w = 1, δ=1𝛿1\delta=1italic_δ = 1, and k=7superscript𝑘7k^{\prime}=7italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 7. Hence, every switch is connected to 7777 other switches. Interestingly, this construction forms the famous Hoffman-Singleton graph [hoffman1960moore, hafner2003hoffman], which is optimal with respect to the Moore Bound. Finally, as a regular and direct network, it is recommended to attach p=k2𝑝superscript𝑘2p=\left\lceil\frac{k^{\prime}}{2}\right\rceilitalic_p = ⌈ divide start_ARG italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⌉ endpoints to each switch to ensure full global bandwidth [besta2014slim]. In our case, p=4𝑝4p=4italic_p = 4.

A.2 Finding Needed Algebraic Structures

Once q𝑞qitalic_q is selected, one uses it to construct several algebraic structures. Specifically, one finds a base ring qsubscript𝑞\mathbb{Z}_{q}blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (for us, 5={0,1,,4}subscript5014\mathbb{Z}_{5}=\{0,1,...,4\}blackboard_Z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = { 0 , 1 , … , 4 }), its primitive element ξ𝜉\xiitalic_ξ that generates all elements of qsubscript𝑞\mathbb{Z}_{q}blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (for us, ξ=2𝜉2\xi=2italic_ξ = 2), and two generator sets X={ξ0,ξ2,,ξq3}𝑋superscript𝜉0superscript𝜉2superscript𝜉𝑞3X=\{\xi^{0},\xi^{2},...,\xi^{q-3}\}italic_X = { italic_ξ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_ξ start_POSTSUPERSCRIPT italic_q - 3 end_POSTSUPERSCRIPT } and X={ξ1,ξ3,,ξq2}superscript𝑋superscript𝜉1superscript𝜉3superscript𝜉𝑞2X^{\prime}=\{\xi^{1},\xi^{3},...,\xi^{q-2}\}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_ξ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , … , italic_ξ start_POSTSUPERSCRIPT italic_q - 2 end_POSTSUPERSCRIPT } (for our installation, X={1,4}𝑋14X=\{1,4\}italic_X = { 1 , 4 } and X={2,3}superscript𝑋23X^{\prime}=\{2,3\}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { 2 , 3 }). While not complex, details on these structures are not necessary to understand our Slim Fly deployment. The interested readers may check them in the original publication [besta2014slim].

A.3 Labeling and Connecting Switches

Each switch receives a 3-tuple label from a set {0,1}×q×q01subscript𝑞subscript𝑞\{0,1\}\times\mathbb{Z}_{q}\times\mathbb{Z}_{q}{ 0 , 1 } × blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Thus, SF switches come in two flavors determined by the first elements of their labels: (0,,)0(0,\cdot,\cdot)( 0 , ⋅ , ⋅ ) and (1,,)1(1,\cdot,\cdot)( 1 , ⋅ , ⋅ ). These labels determine how the switches are connected. Specifically, switches with labels (0,,)0(0,\cdot,\cdot)( 0 , ⋅ , ⋅ ) are connected using the following equation [besta2014slim]:

switch (0,x,y) is connected to (0,x,y)yyX.iffswitch 0𝑥𝑦 is connected to 0𝑥superscript𝑦𝑦superscript𝑦𝑋\text{switch }(0,x,y)\text{ is connected to }(0,x,y^{\prime})\iff y-y^{\prime}% \in X.switch ( 0 , italic_x , italic_y ) is connected to ( 0 , italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⇔ italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_X . (1)

Symmetrically, switches with labels (1,,)1(1,\cdot,\cdot)( 1 , ⋅ , ⋅ ) use the following equation:

switch (1,m,c) is connected to (1,m,c)ccX.iffswitch 1𝑚𝑐 is connected to 1𝑚superscript𝑐𝑐superscript𝑐superscript𝑋\text{switch }(1,m,c)\text{ is connected to }(1,m,c^{\prime})\iff c-c^{\prime}% \in X^{\prime}.switch ( 1 , italic_m , italic_c ) is connected to ( 1 , italic_m , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⇔ italic_c - italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . (2)

Lastly, two switches with labels (0,,)0(0,\cdot,\cdot)( 0 , ⋅ , ⋅ ) and (1,,)1(1,\cdot,\cdot)( 1 , ⋅ , ⋅ ), respectively, are connected according to the following equation:

switch (0,x,y) is connected to (1,m,c)y=mx+ciffswitch 0𝑥𝑦 is connected to 1𝑚𝑐𝑦𝑚𝑥𝑐\text{switch }(0,x,y)\text{ is connected to }(1,m,c)\iff y=m\cdot x+cswitch ( 0 , italic_x , italic_y ) is connected to ( 1 , italic_m , italic_c ) ⇔ italic_y = italic_m ⋅ italic_x + italic_c (3)

A.4 Topology Structure & Physical Layout

The graph underlying the SF topology consists of two same-size subgraphs. One subgraph contains routers (0,x,y)0𝑥𝑦(0,x,y)( 0 , italic_x , italic_y ), the other consists of routers (1,m,c)1𝑚𝑐(1,m,c)( 1 , italic_m , italic_c ). Each subgraph contains q𝑞qitalic_q identical groups of routers. Groups in different subgraphs usually differ from one another. There are no connections between groups within the same subgraph, i.e., no two routers (0,x,y)0𝑥𝑦(0,x,y)( 0 , italic_x , italic_y ) from different groups are linked, the same holds for routers (1,m,c)1𝑚𝑐(1,m,c)( 1 , italic_m , italic_c ). However, each group from one subgraph has connections to every other group in the other subgraph; thus the groups form a fully connected bipartite graph.

This property facilitates physical layout and we use it in our installation. Specifically, as recommended in the original work [besta2014slim], we combine groups from different subgraphs pairwise; these combined groups form racks. In general, this leads to q𝑞qitalic_q racks, each with 2q2𝑞2q2 italic_q routers. In our installation, we have 5 racks, each with 10 routers and 40 compute nodes.

A.5 Constructing Slim Fly with N nodes

As the space of valid SF topologies is quite sparse, we show the simple steps needed to find a SF network with the number of nodes as close to N as possible:

  1. 1.

    Obtain the cube root R of the desired node count N

  2. 2.

    Find prime powers close to R

  3. 3.

    Obtain the corresponding full-bandwidth network configurations (see previous sections)

  4. 4.

    Verify network sizes and select the network that is closest to N in terms of number of supported nodes

Appendix B Routing Details

B.1 Details of Layer Generation

We provide more details on crucial parts of layer generation.

B.1.1 Finding Almost-Minimal Paths

We look for almost-minimal paths that are exactly 3 hops long (one hop longer than SF’s diameter of two), while balancing the number of paths crossing each link (to avoid highly congested links). We do not target longer paths, in order to conserve network resources (i.e., a flow taking fewer hops occupies fewer buffers).

For this, we design a heuristic based on a modified breadth first search graph traversal starting from the source node src𝑠𝑟𝑐srcitalic_s italic_r italic_c, constraining the path length to 3. In theory one could also define a range of valid lengths. The heuristic obtains the set P𝑃Pitalic_P of all valid paths starting in src𝑠𝑟𝑐srcitalic_s italic_r italic_c and ending in the destination node dst𝑑𝑠𝑡dstitalic_d italic_s italic_t; P={(u1,,ul)l=3u1=srcul=dst}𝑃conditional-setsubscript𝑢1subscript𝑢𝑙𝑙3subscript𝑢1𝑠𝑟𝑐subscript𝑢𝑙𝑑𝑠𝑡P=\{(u_{1},\dots,u_{l})\mid l=3\wedge u_{1}=src\wedge u_{l}=dst\}italic_P = { ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∣ italic_l = 3 ∧ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s italic_r italic_c ∧ italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_d italic_s italic_t }. Here, a path is considered valid if it satisfies the given length constraint 3333 and if its insertion into the layer does not affect any previously inserted paths. Then, we choose a path pP𝑝𝑃p\in Pitalic_p ∈ italic_P that minimizes link weights, i.e., pPω(p)ω(p)for-allsuperscript𝑝𝑃𝜔𝑝𝜔superscript𝑝\forall p^{\prime}\in P\ \omega(p)\leq\omega(p^{\prime})∀ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_P italic_ω ( italic_p ) ≤ italic_ω ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where ω(p)𝜔𝑝\omega(p)italic_ω ( italic_p ) is the sum of weights of links included in p𝑝pitalic_p.

B.1.2 Node Pair Priority Queue

The order in which the paths are inserted is very important, because it may impact whether we are able to find new paths. If one would first find a given number of paths for a single node pair, and only then proceed to the next node pair, some node pairs might not receive any, or much fewer, paths than other pairs. To alleviate this, we balance the total number of added almost-minimal paths across all node pairs. For this, each node pair is assigned a priority value, equal to its total number of almost-minimal paths across all layers; the lower the value, the more important it is to find a path for this node pair. Therefore, the number of required priority levels is upper-bounded by |L|1𝐿1|L|-1| italic_L | - 1, because each node pair can have at most one almost-minimal path per each of |L|1𝐿1|L|-1| italic_L | - 1 layers, and is initially in the highest priority value (value of 00). The lowest priority level is value |L|1𝐿1|L|-1| italic_L | - 1, which only contains node pairs who have had an almost-minimal path inserted in every layer.

Whenever a path is added to a layer, all of the node pairs that have a non-minimal path inserted have their priority decreased and they move up to the next higher priority layer. For instance, in Fig. 16 by assuming that the minimum length for an almost-minimal path is two, adding the illustrated path to a layer, results in both node pairs (v1,v4)subscript𝑣1subscript𝑣4(v_{1},v_{4})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) and (v2,v4)subscript𝑣2subscript𝑣4(v_{2},v_{4})( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) having an almost-minimal path added for them (assuming we allow paths of length 2222 and 3333 as non-minimal paths and dst𝑑𝑠𝑡dstitalic_d italic_s italic_t is one of the receiving nodes). Therefore, both of their priorities would be decreased by 1111. This also assumes that the paths were not already in this layer, which could have been the case for (v2,v4)subscript𝑣2subscript𝑣4(v_{2},v_{4})( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ).

The node_pairs𝑛𝑜𝑑𝑒_𝑝𝑎𝑖𝑟𝑠node\_pairsitalic_n italic_o italic_d italic_e _ italic_p italic_a italic_i italic_r italic_s list generated from the priority queue p𝑝pitalic_p in Algorithm 1 contains the entries of the priority queue in the order of priority value, and randomized within each level. Hence, the layer generation algorithm first tries to add an almost-minimal path for all nodes of priority value 00 in a random order, and then move to the nodes of the next value. Hence, it first processes all node pairs with no inserted paths, then with one inserted path, and so forth, facilitating a balanced path distribution across node pairs.

B.1.3 Path Weighting

A weight update is performed after the insertion of a new path into any layer. The weight of each link in any existing path is increased by the total number of new “routes” that now occupy the link. An example is shown in Fig. 15. The weight of link (v1,v2)subscript𝑣1subscript𝑣2(v_{1},v_{2})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is increased by 9999 because it has 9999 new routes using it, as there are 3333 sending nodes (a1a3subscript𝑎1subscript𝑎3a_{1}-a_{3}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) and 3333 receiving nodes (b1b3subscript𝑏1subscript𝑏3b_{1}-b_{3}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT). The weight of link (v3,v4)subscript𝑣3subscript𝑣4(v_{3},v_{4})( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) is increased by 27272727 as there are 9999 sending nodes (a1a9subscript𝑎1subscript𝑎9a_{1}-a_{9}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT) and 3333 receiving nodes (b1b3subscript𝑏1subscript𝑏3b_{1}-b_{3}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT).

Refer to caption
Figure 15: Illustration of the weight update methodology employed by the algorithm. After the insertion of the path from v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, the weights of the links (v1,v2)subscript𝑣1subscript𝑣2(v_{1},v_{2})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), (v2,v3)subscript𝑣2subscript𝑣3(v_{2},v_{3})( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and (v3,v4)subscript𝑣3subscript𝑣4(v_{3},v_{4})( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) are increased by 9999, 18181818, and 27272727, respectively.

B.1.4 Potential Invalidity of Paths

For a given source src𝑠𝑟𝑐srcitalic_s italic_r italic_c and destination dst𝑑𝑠𝑡dstitalic_d italic_s italic_t, it may happen that P=𝑃P=\emptysetitalic_P = ∅, in which case no almost-minimal path is added to a given layer for that node pair. There are two scenarios when this may happen, we illustrate them in Fig. 16 and in Fig. 17. The first one occurs when a path for the node pair is already included in another (previously inserted) path into the layer. For instance, after the path in the figure is inserted into layer l𝑙litalic_l, all sub-paths ((v2,v4)subscript𝑣2subscript𝑣4(v_{2},v_{4})( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ), (v3,v4)subscript𝑣3subscript𝑣4(v_{3},v_{4})( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT )) become included as well, forcing v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to route along minimal paths towards destination v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in layer l𝑙litalic_l.

The second scenario occurs when no path of required length can be found because routing via any of the source node’s neighbors would result in a path too short or too long. In our second example, the almost-minimal paths are constrained to have length exactly 3333. At first, the two almost-minimal paths q=(v1,v2,v3,u3)𝑞subscript𝑣1subscript𝑣2subscript𝑣3subscript𝑢3q=(v_{1},v_{2},v_{3},u_{3})italic_q = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and q=(w1,w2,w3,u3)superscript𝑞subscript𝑤1subscript𝑤2subscript𝑤3subscript𝑢3q^{\prime}=(w_{1},w_{2},w_{3},u_{3})italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) are inserted, which fixes the paths for all node pairs in the set {(vi,u3),(wi,u3)i{1,2,3}}conditional-setsubscript𝑣𝑖subscript𝑢3subscript𝑤𝑖subscript𝑢3𝑖123\{(v_{i},u_{3}),(w_{i},u_{3})\mid i\in\{1,2,3\}\}{ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∣ italic_i ∈ { 1 , 2 , 3 } }. Now any path for the node pair (u1,u3)subscript𝑢1subscript𝑢3(u_{1},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) that respects the already inserted paths will have length l{1,2,4}𝑙124l\in\{1,2,4\}italic_l ∈ { 1 , 2 , 4 } because it would have to come from the following set of paths: {(u1,q)\{(u_{1},q){ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q ), (u1,q)subscript𝑢1superscript𝑞(u_{1},q^{\prime})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), (u1,u3)subscript𝑢1subscript𝑢3(u_{1},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), (u1,u2,u3)subscript𝑢1subscript𝑢2subscript𝑢3(u_{1},u_{2},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), (u1,u2,v2,v3,u3)subscript𝑢1subscript𝑢2subscript𝑣2subscript𝑣3subscript𝑢3(u_{1},u_{2},v_{2},v_{3},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), (u1,u2,w2,w3,u3)}(u_{1},u_{2},w_{2},w_{3},u_{3})\}( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) }. If this scenario occurs, we route minimally, i.e. path (u1,u3)subscript𝑢1subscript𝑢3(u_{1},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

Refer to caption
Figure 16: Illustration of an almost-minimal path from v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, which enforces minimal routing from src𝑠𝑟𝑐srcitalic_s italic_r italic_c nodes like a7subscript𝑎7a_{7}italic_a start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, located on the sub-paths, to dst𝑑𝑠𝑡dstitalic_d italic_s italic_t nodes, i.e. b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, in this layer.
Refer to caption
Figure 17: Illustration of a scenario in which no almost-minimal, valid path of length exactly 3333 can be found for node pair (u1,u3)subscript𝑢1subscript𝑢3(u_{1},u_{3})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) in the given layer due to the prior insertion of two valid paths.

B.1.5 Specification of Forwarding Tables

In layered routing, each forwarding entry (l,s,d)layers×switches×switches𝑙𝑠𝑑layersswitchesswitches(l,s,d)\in\text{layers}\times\text{switches}\times\text{switches}( italic_l , italic_s , italic_d ) ∈ layers × switches × switches corresponds to the port that switch s𝑠sitalic_s uses when routing in layer l𝑙litalic_l and transmitting a packet addressed to a switch d𝑑ditalic_d.

Appendix C Additional Results

C.1 Changes for Custom Alltoall

We decided not to use the OpenMPI’s default implementation of alltoall, as the algorithms it relies on result in sub-optimal performance for the deployed SF. Empirically, we determined that the best-performing alltoall for our system was a simple algorithm that posts all non-blocking send and receive requests simultaneously and then waits for completion. Other collectives did not show a similar impact, and we thus used the default implementations. These issues are not expected with newer hardware.

Refer to caption
Figure 18: Runtime of scientific workloads (lower is better) - SF R vs. FT
Refer to caption
(a) SF R vs. FT
Refer to caption
(b) SF L vs. FT
Figure 19: Runtime of additional scientific workloads (lower is better)
Refer to caption
Figure 20: Performance of HPC benchmarks (higher is better) - SF R vs. FT
Refer to caption
Figure 21: Iteration time of DNN proxy workloads (lower is better) SF R vs. FT and routing improvement of this work over DFSSSP (heatmap) in SF R

C.2 Scientific Workloads & HPC Benchmarks

We show in Fig. 18 the runtime and relative performance of the solver/kernel for each of the scientific workloads on SF using the random placement strategy. We observe similar trends as for the linear placement strategy for all scientific workloads and SF’s performance aligns closely with FT’s, while no significant speedup or slowdown through the use of non-minimal paths could be observed.

In Fig.19, we present the relative performance of two additional scientific workloads, AMG[HENSON2002155] and MiniFE [minife], on SF, using both placement strategies. For this assessment, AMG was configured with a 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT cube per process, while MiniFE was set with grid input dimensions of nx|y|zb=90subscript𝑛𝑥𝑦subscript𝑧𝑏90n_{{x|y|z}_{b}}=90italic_n start_POSTSUBSCRIPT italic_x | italic_y | italic_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 90. In accordance with these configurations, clear weak-scaling behavior is evident under the random placement strategy. On the other hand, with the linear placement strategy, the observed trends are less distinct, and, there are instances where SF outperformed FT by unexpected margins. We believe that this disparity can’t be merely attributed to the variations in communication stemming from the placement strategy, as the applications in consideration aren’t generally communication-bound and the FT is fully non-blocking. However, the precise cause remains unclear.

Fig. 20 shows the performance of the HPC benchmarks on SF using the random placement strategy, results that largely mirror those obtained using the linear placement strategy.

C.3 Deep Learning Workloads

The left part of Fig. 21 shows the runtime and relative performance of the DNN proxies with the random placement strategy. The results are also very similar to those obtained using the linear placement strategy, including GPT-3 matching the performance trends of the MPI Allreduce pattern with the random placement strategy and comparable node counfigurations (cf. Fig. 11(b)).

However, similar to previous results, the right part of Fig. 21 shows that our work generally matches or outperforms DFSSSP, achieving up to a 1.18x speedup.

Appendix D Pricing details

We based our pricing on data colfaxdirect.com666COLFAX DIRECT website and SHI.com777SHI website. Regarding the equipment selection, we use InfiniBand Topology Configurator 888Mellanox InfiniBand Topology Generator tool. For different switch sizes, we selected different models from current Nvidia offerings. For example, for a 36-port switch, we chose Mellanox SB7800 EDR 100Gb/s999Mellanox SB7800 EDR 100Gb/s product detail. For a 40-port switch, we decided to use Mellanox Quantum QM8700 HDR 200Gb/s 101010Mellanox Quantum QM8700 HDR 200Gb/s product detail. Finally, for a 64-port switch, we use Nvidia QM9700 NDR 400G 111111 Nvidia QM9700 NDR 400G product detail model. For AoC cables, we selected active fiber links, and for DAC cables, we chose passive copper cables for endpoint connections. Again, we base our estimations on mentioned earlier InfiniBand Topology Configurator online service. However, it can be challenging to determine the cost of networking hardware because the prices of such hardware can vary greatly depending on the quantity ordered, and large orders may be eligible for substantial discounts.