Search | arXiv e-print repository

Efficient RDMA Communication Protocols

Authors: Konstantin Taranov, Fabian Fischer, Torsten Hoefler

Abstract: Developers of networked systems often work with low-level RDMA libraries to tailor network modules to take full advantage of offload capabilities offered by RDMA-capable network controllers. Because of the huge design space of networked data access protocols and variability in capabilities of RDMA infrastructure, developers tend to reinvent and reimplement common data exchange protocols, wasting m… ▽ More Developers of networked systems often work with low-level RDMA libraries to tailor network modules to take full advantage of offload capabilities offered by RDMA-capable network controllers. Because of the huge design space of networked data access protocols and variability in capabilities of RDMA infrastructure, developers tend to reinvent and reimplement common data exchange protocols, wasting months of development yet missing various performance and system capabilities. In this work, we summarise and categorize RDMA data exchange protocols and elaborate on what features they can offer to networked systems and what implications they have on their memory and network management. △ Less

Submitted 20 December, 2022; v1 submitted 18 December, 2022; originally announced December 2022.

arXiv:2210.15315 [pdf, other]

Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Authors: Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, Torsten Hoefler

Abstract: Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introd… ▽ More Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale. △ Less

Submitted 1 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: To appear in SIGMETRICS 2023

ACM Class: C.2; C.4

arXiv:2206.10007 [pdf, other]

Building Blocks for Network-Accelerated Distributed File Systems

Authors: Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, Torsten Hoefler

Abstract: High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reduci… ▽ More High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2203.14859 [pdf, other]

doi 10.1145/3625549.3658661

FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example

Authors: Marcin Copik, Alexandru Calotoiu, Pengyu Zhou, Konstantin Taranov, Torsten Hoefler

Abstract: FaaS (Function-as-a-Service) revolutionized cloud computing by replacing persistent virtual machines with dynamically allocated resources. This shift trades locality and statefulness for a pay-as-you-go model more suited to variable and infrequent workloads. However, the main challenge is to adapt services to the serverless paradigm while meeting functional, performance, and consistency requiremen… ▽ More FaaS (Function-as-a-Service) revolutionized cloud computing by replacing persistent virtual machines with dynamically allocated resources. This shift trades locality and statefulness for a pay-as-you-go model more suited to variable and infrequent workloads. However, the main challenge is to adapt services to the serverless paradigm while meeting functional, performance, and consistency requirements. In this work, we push the boundaries of FaaS computing by designing a serverless variant of ZooKeeper, a centralized coordination service with a safe and wait-free consensus mechanism. We define synchronization primitives to extend the capabilities of scalable cloud storage and outline a set of requirements for efficient computing with serverless. In FaaSKeeper, the first coordination service built on serverless functions and cloud-native services, we explore the limitations of serverless offerings and propose improvements essential for complex and latency-sensitive applications. We share serverless design lessons based on our experiences of implementing a ZooKeeper model deployable to clouds today. FaaSKeeper maintains the same consistency guarantees and interface as ZooKeeper, with a serverless price model that lowers costs up to 110-719x on infrequent workloads. △ Less

Submitted 1 May, 2024; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: Extended version of the paper accepted for publication at the 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

arXiv:2202.08080 [pdf, other]

NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications

Authors: Konstantin Taranov, Benjamin Rothenberger, Daniele De Sensi, Adrian Perrig, Torsten Hoefler

Abstract: This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vect… ▽ More This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vectors on RDMA-enabled applications and the NVMe-oF protocol, showing that the current security mechanisms of the NVMe-oF protocol do not address the security vulnerabilities posed by the use of RDMA. In particular, we show how an unprivileged user can inject packets into any RDMA connection created on a local network controller, bypassing security mechanisms of the operating system and its kernel, and how the injection can be used to acquire unauthorized block access to NVMe-oF devices. Overall, we implement four attacks on RDMA protocols and seven attacks on the NVMe-oF protocol and verify them on the two most popular implementations of NVMe-oF: SPDK and the Linux kernel. To mitigate the discovered attacks we propose multiple mechanisms that can be implemented by RDMA and NVMe-oF providers. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2106.13859 [pdf, other]

rFaaS: Enabling High Performance Serverless with RDMA and Leases

Authors: Marcin Copik, Konstantin Taranov, Alexandru Calotoiu, Torsten Hoefler

Abstract: High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive costs for high-performance applications. In this work, we investigate how modern cloud programming paradigms can bring the elasticity needed to allocate idle resources, decreasing com… ▽ More High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive costs for high-performance applications. In this work, we investigate how modern cloud programming paradigms can bring the elasticity needed to allocate idle resources, decreasing computation costs and improving overall data center efficiency. Function-as-a-Service (FaaS) brings the pay-as-you-go execution of stateless functions, but its performance characteristics cannot match coarse-grained cloud and cluster allocations. To make serverless computing viable for high-performance and latency-sensitive applications, we present rFaaS, an RDMA-accelerated FaaS platform. We identify critical limitations of serverless - centralized scheduling and inefficient network transport - and improve the FaaS architecture with allocation leases and microsecond invocations. We show that our remote functions add only negligible overhead on top of the fastest available networks, and we decrease the execution latency by orders of magnitude compared to contemporary FaaS systems. Furthermore, we demonstrate the performance of rFaaS by evaluating real-world FaaS benchmarks and parallel applications. Overall, our results show that new allocation policies and remote memory access help FaaS applications achieve high performance and bring serverless computing to HPC. △ Less

Submitted 15 May, 2023; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Accepted for publication in the 2023 International Parallel and Distributed Processing Symposium (IPDPS)

arXiv:2106.07102 [pdf, other]

Farview: Disaggregated Memory with Operator Off-loading for Database Engines

Authors: Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, Gustavo Alonso

Abstract: Cloud deployments disaggregate storage from compute, providing more flexibility to both the storage and compute layers. In this paper, we explore disaggregation by taking it one step further and applying it to memory (DRAM). Disaggregated memory uses network attached DRAM as a way to decouple memory from CPU. In the context of databases, such a design offers significant advantages in terms of maki… ▽ More Cloud deployments disaggregate storage from compute, providing more flexibility to both the storage and compute layers. In this paper, we explore disaggregation by taking it one step further and applying it to memory (DRAM). Disaggregated memory uses network attached DRAM as a way to decouple memory from CPU. In the context of databases, such a design offers significant advantages in terms of making a larger memory capacity available as a central pool to a collection of smaller processing nodes. To explore these possibilities, we have implemented Farview, a disaggregated memory solution for databases, operating as a remote buffer cache with operator offloading capabilities. Farview is implemented as an FPGA-based smart NIC making DRAM available as a disaggregated, network attached memory module capable of performing data processing at line rate over data streams to/from disaggregated memory. Farview supports query offloading using operators such as selection, projection, aggregation, regular expression matching and encryption. In this paper we focus on analytical queries and demonstrate the viability of the idea through an extensive experimental evaluation of Farview under different workloads. Farview is competitive with a local buffer cache solution for all the workloads and outperforms it in a number of cases, proving that a smart disaggregated memory can be a viable alternative for databases deployed in cloud environments. △ Less

Submitted 13 June, 2021; originally announced June 2021.

Comments: 12 pages

arXiv:1908.08590 [pdf, other]

doi 10.1145/3295500.3356189

Network-Accelerated Non-Contiguous Memory Transfers

Authors: Salvatore Di Girolamo, Konstantin Taranov, Andreas Kurth, Michael Schaffner, Timo Schneider, Jakub Beránek, Maciej Besta, Luca Benini, Duncan Roweth, Torsten Hoefler

Abstract: Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate w… ▽ More Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently networkaccelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 10x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration. △ Less

Submitted 22 August, 2019; originally announced August 2019.

Comments: In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Nov. 2019

arXiv:1709.05483 [pdf, other]

sPIN: High-performance streaming Processing in the Network

Authors: Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, Ron Brightwell

Abstract: Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload s… ▽ More Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services. △ Less

Submitted 19 October, 2017; v1 submitted 16 September, 2017; originally announced September 2017.

Comments: 20 pages

Showing 1–9 of 9 results for author: Taranov, K