-
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
Authors:
Shixiong Qi,
K. K. Ramakrishnan,
Myung** Lee
Abstract:
Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management.…
▽ More
Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers.
We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs
Authors:
Aditya Dhakal,
Sameer G. Kulkarni,
K. K. Ramakrishnan
Abstract:
Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of chall…
▽ More
Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput.
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
MiddleNet: A Unified, High-Performance NFV and Middlebox Framework with eBPF and DPDK
Authors:
Shixiong Qi,
Ziteng Zeng,
Leslie Monis,
K. K. Ramakrishnan
Abstract:
Traditional network resident functions (e.g., firewalls, network address translation) and middleboxes (caches, load balancers) have moved from purpose-built appliances to software-based components. However, L2/L3 network functions (NFs) are being implemented on Network Function Virtualization (NFV) platforms that extensively exploit kernel-bypass technology. They often use DPDK for zero-copy deliv…
▽ More
Traditional network resident functions (e.g., firewalls, network address translation) and middleboxes (caches, load balancers) have moved from purpose-built appliances to software-based components. However, L2/L3 network functions (NFs) are being implemented on Network Function Virtualization (NFV) platforms that extensively exploit kernel-bypass technology. They often use DPDK for zero-copy delivery and high performance. On the other hand, L4/L7 middleboxes, which have a greater emphasis on functionality, take advantage of a full-fledged kernel-based system.
L2/L3 NFs and L4/L7 middleboxes continue to be handled by distinct platforms on different nodes. This paper proposes MiddleNet that develops a unified network resident function framework that supports L2/L3 NFs and L4/L7 middleboxes. MiddleNet supports function chains that are essential in both NFV and middlebox environments. MiddleNet uses the Data Plane Development Kit (DPDK) library for zero-copy packet delivery without interrupt-based processing, to enable the "bump-in-the-wire" L2/L3 processing performance required of NFV. To support L4/L7 middlebox functionality, MiddleNet utilizes a consolidated, kernel-based protocol stack for processing, avoiding a dedicated protocol stack for each function. MiddleNet fully exploits the event-driven capabilities of the extended Berkeley Packet Filter (eBPF) and seamlessly integrates it with shared memory for high-performance communication in L4/L7 middlebox function chains. The overheads for MiddleNet in L4/L7 are strictly load-proportional, without needing the dedicated CPU cores of DPDK-based approaches. MiddleNet supports flow-dependent packet processing by leveraging Single Root I/O Virtualization (SR-IOV) to dynamically select the packet processing needed (Layers 2 - 7). Our experimental results show that MiddleNet achieves high performance in such a unified environment.
△ Less
Submitted 30 March, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Analyzing Open-Source Serverless Platforms: Characteristics and Performance
Authors:
Junfeng Li,
Sameer G. Kulkarni,
K. K. Ramakrishnan,
Dan Li
Abstract:
Serverless computing is increasingly popular because of its lower cost and easier deployment. Several cloud service providers (CSPs) offer serverless computing on their public clouds, but it may bring the vendor lock-in risk. To avoid this limitation, many open-source serverless platforms come out to allow developers to freely deploy and manage functions on self-hosted clouds. However, building ef…
▽ More
Serverless computing is increasingly popular because of its lower cost and easier deployment. Several cloud service providers (CSPs) offer serverless computing on their public clouds, but it may bring the vendor lock-in risk. To avoid this limitation, many open-source serverless platforms come out to allow developers to freely deploy and manage functions on self-hosted clouds. However, building effective functions requires much expertise and thorough comprehension of platform frameworks and features that affect performance. It is a challenge for a service developer to differentiate and select the appropriate serverless platform for different demands and scenarios. Thus, we elaborate the frameworks and event processing models of four popular open-source serverless platforms and identify their salient idiosyncrasies. We analyze the root causes of performance differences between different service exporting and auto-scaling modes on those platforms. Further, we provide several insights for future work, such as auto-scaling and metric collection.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
CoShare: An Efficient Approach for Redundancy Allocation in NFV
Authors:
Yordanos Tibebu Woldeyohannes,
Besmir Tola,
Yuming Jiang,
K. K. Ramakrishnan
Abstract:
An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. On the one hand this offers great flexibility in allocation of redundant instances, but on the other hand it makes the allocation a unique and difficult challenge. One particular concern is that there is inherent correlation among nodes due to t…
▽ More
An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. On the one hand this offers great flexibility in allocation of redundant instances, but on the other hand it makes the allocation a unique and difficult challenge. One particular concern is that there is inherent correlation among nodes due to the structure of the network, thus requiring special care in this allocation. To this aim, our novel approach, called CoShare, is proposed. Firstly, its design takes into consideration the effect of network structural dependency, which might result in the unavailability of nodes of a network after failure of a node. Secondly, to efficiently make use of resources, CoShare proposes the idea of shared reservation, where multiple flows may be allowed to share the same reserved backup capacity at an NF instance. Furthermore, CoShare factors in the heterogeneity in nodes, NF instances and availability requirements of flows in the design. The results from a number of experiments conducted using realistic network topologies show that the integration of structural dependency allows meeting availability requirements for more flows compared to a baseline approach. Specifically, CoShare is able to meet diverse availability requirements in a resource-efficient manner, requiring, e.g., up to 85% in some studied cases, less resource overbuild than the baseline approach that uses the idea of dedicated reservation commonly adopted for redundancy allocation in NFV.
△ Less
Submitted 22 November, 2021; v1 submitted 31 August, 2020;
originally announced August 2020.
-
Spatial Sharing of GPU for Autotuning DNN models
Authors:
Aditya Dhakal,
Junguk Cho,
Sameer G. Kulkarni,
K. K. Ramakrishnan,
Puneet Sharma
Abstract:
GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources…
▽ More
GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference is tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an interdependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. While a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the causes that impact the tuning of a model at different amounts of GPU resources. We present many techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by up to 75 percent and increase throughput by a factor of 5.
△ Less
Submitted 8 August, 2020;
originally announced August 2020.
-
Understanding Open Source Serverless Platforms: Design Considerations and Performance
Authors:
Junfeng Li,
Sameer G. Kulkarni,
K. K. Ramakrishnan,
Dan Li
Abstract:
Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular op…
▽ More
Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular open-source serverless platforms. We identify the idiosyncrasies affecting performance (throughput and latency) for different open-source serverless platforms. Further, we observe that just having either resource-based (CPU and memory) or workload-based (request per second (RPS) or concurrent requests) auto-scaling is inadequate to address the needs of the serverless platforms.
△ Less
Submitted 12 December, 2019; v1 submitted 18 November, 2019;
originally announced November 2019.
-
SDNFV: Flexible and Dynamic Software Defined Control of an Application- and Flow-Aware Data Plane
Authors:
Wei Zhang,
Guyue Liu,
Timothy Wood,
K. K. Ramakrishnan,
**ho Hwang
Abstract:
Software Defined Networking (SDN) promises greater flexibility for directing packet flows, and Network Function Virtualization promises to enable dynamic management of software-based network functions. However, the current divide between an intelligent control plane and an overly simple, stateless data plane results in the inability to exploit the flexibility of a software based network. In this p…
▽ More
Software Defined Networking (SDN) promises greater flexibility for directing packet flows, and Network Function Virtualization promises to enable dynamic management of software-based network functions. However, the current divide between an intelligent control plane and an overly simple, stateless data plane results in the inability to exploit the flexibility of a software based network. In this paper we propose SDNFV, a framework that expands the capabilities of network processing-and-forwarding elements to flexibly manage packet flows, while retaining both a high performance data plane and an easily managed control plane.
SDNFV proposes a hierarchical control framework where decisions are made across the SDN controller, a host-level manager, and individual VMs to best exploit state available at each level. This increases the network's flexibility compared to existing SDNs where controllers often make decisions solely based on the first packet header of a flow. SDNFV intelligently places network services across hosts and connects them in sequential and parallel chains, giving both the SDN controller and individual network functions the ability to enhance and update flow rules to adapt to changing conditions. Our prototype demonstrates how to efficiently and flexibly reroute flows based on data plane state such as packet payloads and traffic characteristics.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
SAID: A Control Protocol for Scalable and Adaptive Information Dissemination in ICN
Authors:
Jiachen Chen,
Mayutan Arumaithurai,
Xiaoming Fu,
K. K. Ramakrishnan
Abstract:
Information dissemination applications (video, news, social media, etc.) with large number of receivers need to be efficient but also have limited loss tolerance. The new Information-Centric Networks (ICN) paradigm offers an alternative approach for reliably delivering data by naming content and exploiting data available at any intermediate point (e.g., caches). However, receivers are often hetero…
▽ More
Information dissemination applications (video, news, social media, etc.) with large number of receivers need to be efficient but also have limited loss tolerance. The new Information-Centric Networks (ICN) paradigm offers an alternative approach for reliably delivering data by naming content and exploiting data available at any intermediate point (e.g., caches). However, receivers are often heterogeneous, with widely varying receive rates. When using existing ICN congestion control mechanisms with in-sequence delivery, a particularly thorny problem of receivers going out-of-sync results in inefficiency and unfairness with heterogeneous receivers. We argue that separating reliability from congestion control leads to more scalable, efficient and fair data dissemination, and propose SAID, a Control Protocol for Scalable and Adaptive Information Dissemination in ICN. To maximize the amount of data transmitted at the first attempt, receivers request any next packet (ANP) of a flow instead of next-in-sequence packet, independent of the provider's transmit rate. This allows providers to transmit at an application-efficient rate, without being limited by the slower receivers. SAID ensures reliable delivery to all receivers eventually, by cooperative repair, while preserving privacy without unduly trusting other receivers.
△ Less
Submitted 28 October, 2015;
originally announced October 2015.
-
Evaluating Opportunistic Delivery of Large Content with TCP over WiFi in I2V Communication
Authors:
Shreyasee Mukherjee,
Kai Su,
Narayan B. Mandayam,
K. K. Ramakrishnan,
Dipankar Raychaudhuri,
Ivan Seskar
Abstract:
With the increasing interest in connected vehicles, it is useful to evaluate the capability of delivering large content over a WiFi infrastructure to vehicles. The throughput achieved over WiFi channels can be highly variable and also rapidly degrades as the distance from the access point increases. While this behavior is well understood at the data link layer, the interactions across the various…
▽ More
With the increasing interest in connected vehicles, it is useful to evaluate the capability of delivering large content over a WiFi infrastructure to vehicles. The throughput achieved over WiFi channels can be highly variable and also rapidly degrades as the distance from the access point increases. While this behavior is well understood at the data link layer, the interactions across the various protocol layers (data link and up through the transport layer) and the effect of mobility may reduce the amount of content transferred to the vehicle, as it travels along the roadway.
This paper examines the throughput achieved at the TCP layer over a carefully designed outdoor WiFi environment and the interactions across the layers that impact the performance achieved, as a function of the receiver mobility. The experimental studies conducted reveal that impairments over the WiFi link (frame loss, ARQ and increased delay) and the residual loss seen by TCP causes a cascade of duplicate ACKs to be generated. This triggers large congestion window reductions at the sender, leading to a drastic degradation of throughput to the vehicular client. To ensure outdoor WiFi infrastructures have the potential to sustain reasonable downlink throughput for drive-by vehicles, we speculate that there is a need to adapt how WiFi and TCP (as well as mobility protocols) function for such vehicular applications.
△ Less
Submitted 9 October, 2014;
originally announced October 2014.
-
Opportunities in a Federated Cloud Marketplace
Authors:
Hamed Haddadi,
Georgios Smaragdakis,
K. K. Ramakrishnan
Abstract:
Recent measurement studies show that there are massively distributed hosting and computing infrastructures deployed in the Internet. Such infrastructures include large data centers and organizations' computing clusters. When idle, these resources can readily serve local users. Such users can be smartphone or tablet users wishing to access services such as remote desktop or CPU/bandwidth intensive…
▽ More
Recent measurement studies show that there are massively distributed hosting and computing infrastructures deployed in the Internet. Such infrastructures include large data centers and organizations' computing clusters. When idle, these resources can readily serve local users. Such users can be smartphone or tablet users wishing to access services such as remote desktop or CPU/bandwidth intensive activities. Particularly, when they are likely to have high latency to access, or may have no access at all to, centralized cloud providers. Today, however, there is no global marketplace where sellers and buyers of available resources can trade. The recently introduced marketplaces of Amazon and other cloud infrastructures are limited by the network footprint of their own infrastructures and availability of such services in the target country and region. In this article we discuss the potentials for a federated cloud marketplace where sellers and buyers of a number of resources, including storage, computing, and network bandwidth, can freely trade. This ecosystem can be regulated through brokers who act as service level monitors and auctioneers. We conclude by discussing the challenges and opportunities in this space.
△ Less
Submitted 8 May, 2014;
originally announced May 2014.
-
Internames: a name-to-name principle for the future Internet
Authors:
Nicola Blefari Melazzi,
Andrea Detti,
Mayutan Arumaithurai,
K. K. Ramakrishnan
Abstract:
We propose Internames, an architectural framework in which names are used to identify all entities involved in communication: contents, users, devices, logical as well as physical points involved in the communication, and services. By not having a static binding between the name of a communication entity and its current location, we allow entities to be mobile, enable them to be reached by any of…
▽ More
We propose Internames, an architectural framework in which names are used to identify all entities involved in communication: contents, users, devices, logical as well as physical points involved in the communication, and services. By not having a static binding between the name of a communication entity and its current location, we allow entities to be mobile, enable them to be reached by any of a number of basic communication primitives, enable communication to span networks with different technologies and allow for disconnected operation. Furthermore, with the ability to communicate between names, the communication path can be dynamically bound to any of a number of end-points, and the end-points themselves could change as needed. A key benefit of our architecture is its ability to accommodate gradual migration from the current IP infrastructure to a future that may be a ubiquitous Information Centric Network. Basic building blocks of Internames are: i) a name-based Application Programming Interface; ii) a separation of identifiers (names) and locators; iii) a powerful Name Resolution Service (NRS) that dynamically maps names to locators, as a function of time/location/context/service; iv) a built-in capacity of evolution, allowing a transparent migration from current networks and the ability to include as particular cases current specific architectures. To achieve this vision, shared by many other researchers, we exploit and expand on Information Centric Networking principles, extending ICN functionality beyond content retrieval, easing send-to-name and push services, and allowing to use names also to route data in the return path. A key role in this architecture is played by the NRS, which allows for the co-existence of multiple network "realms", including current IP and non-IP networks, glued together by a name-to-name overarching communication primitive.
△ Less
Submitted 31 December, 2013;
originally announced January 2014.
-
Design and Characterization of a Full-duplex Multi-antenna System for WiFi networks
Authors:
Melissa Duarte,
Ashutosh Sabharwal,
Vaneet Aggarwal,
Rittwik Jana,
K. K. Ramakrishnan,
Christopher Rice,
N. K. Shankaranarayanan
Abstract:
In this paper, we present an experimental and simulation based study to evaluate the use of full-duplex as a mode in practical IEEE 802.11 networks. To enable the study, we designed a 20 MHz multi-antenna OFDM full-duplex physical layer and a full-duplex capable MAC protocol which is backward compatible with current 802.11. Our extensive over-the-air experiments, simulations and analysis demonstra…
▽ More
In this paper, we present an experimental and simulation based study to evaluate the use of full-duplex as a mode in practical IEEE 802.11 networks. To enable the study, we designed a 20 MHz multi-antenna OFDM full-duplex physical layer and a full-duplex capable MAC protocol which is backward compatible with current 802.11. Our extensive over-the-air experiments, simulations and analysis demonstrate the following two results. First, the use of multiple antennas at the physical layer leads to a higher ergodic throughput than its hardware-equivalent multi-antenna half-duplex counterparts, for SNRs above the median SNR encountered in practical WiFi deployments. Second, the proposed MAC translates the physical layer rate gain into near doubling of throughput for multi-node single-AP networks. The two combined results allow us to conclude that there are potentially significant benefits gained from including a full-duplex mode in future WiFi standards.
△ Less
Submitted 7 October, 2012; v1 submitted 4 October, 2012;
originally announced October 2012.
-
Intra- and Inter-Session Network Coding in Wireless Networks
Authors:
Hulya Seferoglu,
Athina Markopoulou,
K. K. Ramakrishnan
Abstract:
In this paper, we are interested in improving the performance of constructive network coding schemes in lossy wireless environments.We propose I2NC - a cross-layer approach that combines inter-session and intra-session network coding and has two strengths. First, the error-correcting capabilities of intra-session network coding make our scheme resilient to loss. Second, redundancy allows intermedi…
▽ More
In this paper, we are interested in improving the performance of constructive network coding schemes in lossy wireless environments.We propose I2NC - a cross-layer approach that combines inter-session and intra-session network coding and has two strengths. First, the error-correcting capabilities of intra-session network coding make our scheme resilient to loss. Second, redundancy allows intermediate nodes to operate without knowledge of the decoding buffers of their neighbors. Based only on the knowledge of the loss rates on the direct and overhearing links, intermediate nodes can make decisions for both intra-session (i.e., how much redundancy to add in each flow) and inter-session (i.e., what percentage of flows to code together) coding. Our approach is grounded on a network utility maximization (NUM) formulation of the problem. We propose two practical schemes, I2NC-state and I2NC-stateless, which mimic the structure of the NUM optimal solution. We also address the interaction of our approach with the transport layer. We demonstrate the benefits of our schemes through simulations.
△ Less
Submitted 23 February, 2012; v1 submitted 31 August, 2010;
originally announced August 2010.