-
Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization
Authors:
Hamidreza Almasi,
Harsh Mishra,
Balajee Vamanan,
Sathya N. Ravi
Abstract:
Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a di…
▽ More
Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios $\in (0,1]$, and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator
△ Less
Submitted 24 September, 2023; v1 submitted 12 February, 2023;
originally announced February 2023.
-
Modeling Performance and Energy trade-offs in Online Data-Intensive Applications
Authors:
Ajay Badita,
Rooji **an,
Balajee Vamanan,
Parimal Parag
Abstract:
We consider energy minimization for data-intensive applications run on large number of servers, for given performance guarantees. We consider a system, where each incoming application is sent to a set of servers, and is considered to be completed if a subset of them finish serving it. We consider a simple case when each server core has two speed levels, where the higher speed can be achieved by hi…
▽ More
We consider energy minimization for data-intensive applications run on large number of servers, for given performance guarantees. We consider a system, where each incoming application is sent to a set of servers, and is considered to be completed if a subset of them finish serving it. We consider a simple case when each server core has two speed levels, where the higher speed can be achieved by higher power for each core independently. The core selects one of the two speeds probabilistically for each incoming application request. We model arrival of application requests by a Poisson process, and random service time at the server with independent exponential random variables. Our model and analysis generalizes to today's state-of-the-art in CPU energy management where each core can independently select a speed level from a set of supported speeds and corresponding voltages. The performance metrics under consideration are the mean number of applications in the system and the average energy expenditure. We first provide a tight approximation to study this previously intractable problem and derive closed form approximate expressions for the performance metrics when service times are exponentially distributed. Next, we study the trade-off between the approximate mean number of applications and energy expenditure in terms of the switching probability.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Pulser: Fast Congestion Response using Explicit Incast Notifications for Datacenter Networks
Authors:
Hamidrezae Almasi,
Hamed Rezaei,
Muhammad Usama Chaudhry,
Balajee Vamanan
Abstract:
Datacenter applications frequently cause incast congestion, which degrades both flow completion times of short flows and throughput of long flows. Without isolating incast, existing congestion control schemes (e.g., DCTCP) rely on existing ECN signal to react to general congestion, and they lose performance due to their slow, cautious, and inaccurate reaction to incast. We propose to isolate incas…
▽ More
Datacenter applications frequently cause incast congestion, which degrades both flow completion times of short flows and throughput of long flows. Without isolating incast, existing congestion control schemes (e.g., DCTCP) rely on existing ECN signal to react to general congestion, and they lose performance due to their slow, cautious, and inaccurate reaction to incast. We propose to isolate incast using Explicit Incast Notifications (EIN) that are generated by switches, similar to ECN. Our incast detection is fast and accurate. Further, we present our congestion control scheme, called Pulser, which drastically backs off during incast based on EIN, but restores sending rate once incast ends. Our real experiments and ns-3 simulations show that Pulser outperforms prior schemes, DCTCP and ICTCP, in both flow completion times and throughput.
△ Less
Submitted 2 October, 2018; v1 submitted 25 September, 2018;
originally announced September 2018.
-
Slytherin: Dynamic, Network-assisted Prioritization of Tail Packets in Datacenter Networks
Authors:
Hamed Rezaei,
Mojtaba Malekpourshahraki,
Balajee Vamanan
Abstract:
Datacenter applications demand both low latency and high throughput; while interactive applications (e.g., Web Search) demand low tail latency for their short messages due to their partition-aggregate software architecture, many data-intensive applications (e.g., Map-Reduce) require high throughput for long flows as they move vast amounts of data across the network. Recent proposals improve latenc…
▽ More
Datacenter applications demand both low latency and high throughput; while interactive applications (e.g., Web Search) demand low tail latency for their short messages due to their partition-aggregate software architecture, many data-intensive applications (e.g., Map-Reduce) require high throughput for long flows as they move vast amounts of data across the network. Recent proposals improve latency of short flows and throughput of long flows by addressing the shortcomings of existing packet scheduling and congestion control algorithms, respectively. We make the key observation that long tails in the Flow Completion Times (FCT) of short flows result from packets that suffer congestion at more than one switch along their paths in the network. Our proposal, Slytherin, specifically targets packets that suffered from congestion at multiple points and prioritizes them in the network. Slytherin leverages ECN mechanism which is widely used in existing datacenters to identify such tail packets and dynamically prioritizes them using existing priority queues. As compared to existing state-of-the-art packet scheduling proposals, Slytherin achieves 18.6% lower 99th percentile flow completion times for short flows without any loss of throughput. Further, Slytherin drastically reduces 99th percentile queue length in switches by a factor of about 2x on average.
△ Less
Submitted 5 July, 2018;
originally announced July 2018.
-
Dart: Divide and Specialize for Fast Response to Congestion in RDMA-based Datacenter Networks
Authors:
Jiachen Xue,
Muhammad Usama Chaudhry,
Balajee Vamanan,
T. N. Vijaykumar,
Mithuna Thottethodi
Abstract:
Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have s…
▽ More
Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called Dart, which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose direct apportioning of sending rates (DASR) in which a receiver for n senders directs each sender to cut its rate by a factor of n, converging in only one RTT. For the spatially-localized case, Dart provides fast (under one RTT) response by adding novel switch hardware for in-order flow deflection (IOFD) because RDMA disallows packet reordering on which previous load balancing schemes rely. For the uncommon spatially-dispersed case, Dart falls back to DCQCN. Small-scale testbed measurements and at-scale simulations, respectively, show that Dart achieves 60% (2.5x) and 79% (4.8x) lower 99th-percentile latency, and similar and 58% higher throughput than InfiniBand, and TIMELY and DCQCN.
△ Less
Submitted 30 December, 2019; v1 submitted 28 May, 2018;
originally announced May 2018.
-
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers
Authors:
Yiyang Chang,
Ashkan Rezaei,
Balajee Vamanan,
Jahangir Hasan,
Sanjay Rao,
T. N. Vijaykumar
Abstract:
The conventional approach to scaling Software Defined Networking (SDN) controllers today is to partition switches based on network topology, with each partition being controlled by a single physical controller, running all SDN applications. However, topological partitioning is limited by the fact that (i) performance of latency-sensitive (e.g., monitoring) SDN applications associated with a given…
▽ More
The conventional approach to scaling Software Defined Networking (SDN) controllers today is to partition switches based on network topology, with each partition being controlled by a single physical controller, running all SDN applications. However, topological partitioning is limited by the fact that (i) performance of latency-sensitive (e.g., monitoring) SDN applications associated with a given partition may be impacted by co-located compute-intensive (e.g., route computation) applications; (ii) simultaneously achieving low convergence time and response times might be challenging; and (iii) communication between instances of an application across partitions may increase latencies. To tackle these issues, in this paper, we explore functional slicing, a complementary approach to scaling, where multiple SDN applications belonging to the same topological partition may be placed in physically distinct servers. We present Hydra, a framework for distributed SDN controllers based on functional slicing. Hydra chooses partitions based on convergence time as the primary metric, but places application instances across partitions in a manner that keeps response times low while considering communication between applications of a partition, and instances of an application across partitions. Evaluations using the Floodlight controller show the importance and effectiveness of Hydra in simultaneously kee** convergence times on failures small, while sustaining higher throughput per partition and ensuring responsiveness to latency-sensitive applications.
△ Less
Submitted 22 September, 2016;
originally announced September 2016.
-
MigrantStore: Leveraging Virtual Memory in DRAM-PCM Memory Architecture
Authors:
Hamza Bin Sohail,
Balajee Vamanan,
T. N. Vijaykumar
Abstract:
With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to a…
▽ More
With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to absorb writes and provide faster access. However, due to ineffectual caching where blocks are evicted before sufficient number of accesses, hardware caches incur significant overheads in energy and bandwidth, two key but scarce resources in modern multicores. Because using hardware for detecting and removing such ineffectual caching would incur additional hardware cost and complexity, we leverage the OS virtual memory support for this purpose. We propose a DRAM-PCM hybrid memory architecture where the OS migrates pages on demand from the PCM to DRAM. We call the DRAM part of our memory as MigrantStore which includes two ideas. First, to reduce the energy, bandwidth, and wear overhead of ineffectual migrations, we propose migration hysteresis. Second, to reduce the software overhead of good replacement policies, we propose recently- accessed-page-id (RAPid) buffer, a hardware buffer to track the addresses of recently-accessed MigrantStore pages.
△ Less
Submitted 16 April, 2015;
originally announced April 2015.
-
TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications
Authors:
Balajee Vamanan,
Hamza Bin Sohail,
Jahangir Hasan,
T. N. Vijaykumar
Abstract:
Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see la…
▽ More
Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see latency variations both in the network and compute. Previous work proposes to achieve load-proportional energy by slowing down the computation at lower datacenter loads based directly on response times (i.e., at lower loads, the proposal exploits the average slack in the time budget provisioned for the peak load). In contrast, we propose TimeTrader to reduce energy by exploiting the latency slack in the sub- critical replies which arrive before the deadline (e.g., 80% of replies are 3-4x faster than the tail). This slack is present at all loads and subsumes the previous work's load-related slack. While the previous work shifts the leaves' response time distribution to consume the slack at lower loads, TimeTrader reshapes the distribution at all loads by slowing down individual sub-critical nodes without increasing missed deadlines. TimeTrader exploits slack in both the network and compute budgets. Further, TimeTrader leverages Earliest Deadline First scheduling to largely decouple critical requests from the queuing delays of sub- critical requests which can then be slowed down without hurting critical requests. A combination of real-system measurements and at-scale simulations shows that without adding to missed deadlines, TimeTrader saves 15-19% and 41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512 nodes, whereas previous work saves 0% and 31-37%.
△ Less
Submitted 18 March, 2015;
originally announced March 2015.