Search | arXiv e-print repository

Resilience of the Electric Grid through Trustable IoT-Coordinated Assets

Authors: Vineet J. Nair, Venkatesh Venkataramanan, Priyank Srivastava, Partha S. Sarker, Anurag Srivastava, Laurentiu D. Marinovici, Jun Zha, Christopher Irwin, Prateek Mittal, John Williams, H. Vincent Poor, Anuradha M. Annaswamy

Abstract: The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. Ho… ▽ More The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. However, they can introduce new vulnerabilities in the form of cyberattacks, which can cause significant challenges in ensuring grid resilience. %, i.e. the ability to rapidly restore grid services in the face of severe disruptions. We propose a framework in this paper for achieving grid resilience through suitably coordinated assets including a network of Internet of Things (IoT) devices. A local electricity market is proposed to identify trustable assets and carry out this coordination. Situational Awareness (SA) of locally available DERs with the ability to inject power or reduce consumption is enabled by the market, together with a monitoring procedure for their trustability and commitment. With this SA, we show that a variety of cyberattacks can be mitigated using local trustable resources without stressing the bulk grid. The demonstrations are carried out using a variety of platforms with a high-fidelity co-simulation platform, real-time hardware-in-the-loop validation, and a utility-friendly simulator. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Submitted to the Proceedings of the National Academy of Sciences (PNAS), under review

arXiv:2404.14632 [pdf, other]

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Authors: Muhammad Adnan, Amar Phanishayee, Janardhan Kulkarni, Prashant J. Nair, Divya Mahajan

Abstract: In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP und… ▽ More In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.04270 [pdf, other]

Accelerating Recommender Model Training by Dynamically Skip** Stale Embeddings

Authors: Yassaman Ebrahimzadeh Maboud, Muhammad Adnan, Divya Mahajan, Prashant J. Nair

Abstract: Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variatio… ▽ More Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively. △ Less

Submitted 21 March, 2024; originally announced April 2024.

arXiv:2403.17480 [pdf, other]

Capacity Provisioning Motivated Online Non-Convex Optimization Problem with Memory and Switching Cost

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: An online non-convex optimization problem is considered where the goal is to minimize the flow time (total delay) of a set of jobs by modulating the number of active servers, but with a switching cost associated with changing the number of active servers over time. Each job can be processed by at most one fixed speed server at any time. Compared to the usual online convex optimization (OCO) proble… ▽ More An online non-convex optimization problem is considered where the goal is to minimize the flow time (total delay) of a set of jobs by modulating the number of active servers, but with a switching cost associated with changing the number of active servers over time. Each job can be processed by at most one fixed speed server at any time. Compared to the usual online convex optimization (OCO) problem with switching cost, the objective function considered is non-convex and more importantly, at each time, it depends on all past decisions and not just the present one. Both worst-case and stochastic inputs are considered; for both cases, competitive algorithms are derived. △ Less

Submitted 1 July, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.09054 [pdf, other]

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Authors: Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath

Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phas… ▽ More Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy. △ Less

Submitted 5 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

MSC Class: 68U35 ACM Class: I.2.7; C.0

Journal ref: Proceedings of the 7th Annual Conference on Machine Learning and Systems (MLSys), 2024

arXiv:2309.12687 [pdf, other]

Fixed confidence community mode estimation

Authors: Meera Pai, Nikhil Karamchandani, Jayakrishnan Nair

Abstract: Our aim is to estimate the largest community (a.k.a., mode) in a population composed of multiple disjoint communities. This estimation is performed in a fixed confidence setting via sequential sampling of individuals with replacement. We consider two sampling models: (i) an identityless model, wherein only the community of each sampled individual is revealed, and (ii) an identity-based model, wher… ▽ More Our aim is to estimate the largest community (a.k.a., mode) in a population composed of multiple disjoint communities. This estimation is performed in a fixed confidence setting via sequential sampling of individuals with replacement. We consider two sampling models: (i) an identityless model, wherein only the community of each sampled individual is revealed, and (ii) an identity-based model, wherein the learner is able to discern whether or not each sampled individual has been sampled before, in addition to the community of that individual. The former model corresponds to the classical problem of identifying the mode of a discrete distribution, whereas the latter seeks to capture the utility of identity information in mode estimation. For each of these models, we establish information theoretic lower bounds on the expected number of samples needed to meet the prescribed confidence level, and propose sound algorithms with a sample complexity that is provably asymptotically optimal. Our analysis highlights that identity information can indeed be utilized to improve the efficiency of community mode estimation. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: To appear in Performance Evaluation

arXiv:2308.14902 [pdf, other]

Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks

Authors: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

Abstract: Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user b… ▽ More Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user behaviour. Traditional feature interaction techniques have limitations in achieving optimal performance in this context. This work introduces Ad-Rec, an advanced network that leverages feature interaction techniques to address covariate shifts. This helps eliminate irrelevant interactions in recommendation tasks. Ad-Rec leverages masked transformers to enable the learning of higher-order cross-features while mitigating the impact of data distribution drift. Our approach improves model quality, accelerates convergence, and reduces training time, as measured by the Area Under Curve (AUC) metric. We demonstrate the scalability of Ad-Rec and its ability to achieve superior model quality through comprehensive ablation studies. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2307.02623 [pdf, other]

FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout

Authors: Irene Wang, Prashant J. Nair, Divya Mahajan

Abstract: Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, straggler devices with lower performance often dictate the overall traini… ▽ More Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, straggler devices with lower performance often dictate the overall training time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy. Building on this dropout technique, we develop an adaptive training framework, Federated Learning using Invariant Dropout (FLuID). FLuID offers a lightweight sub-model extraction to regulate computational intensity, thereby reducing the load on straggler devices without affecting model quality. Our method leverages neuron updates from non-straggler devices to construct a tailored sub-model for each straggler based on client performance profiling. Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime conditions shift. We evaluate FLuID using five real-world mobile clients. The evaluations show that Invariant Dropout maintains baseline model efficiency while alleviating the performance bottleneck of stragglers through a dynamic, runtime approach. △ Less

Submitted 26 September, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023

arXiv:2305.06082 [pdf, ps, other]

Best Arm Identification in Bandits with Limited Precision Sampling

Authors: Kota Srinivas Reddy, P. N. Karthik, Nikhil Karamchandani, Jayakrishnan Nair

Abstract: We study best arm identification in a variant of the multi-armed bandit problem where the learner has limited precision in arm selection. The learner can only sample arms via certain exploration bundles, which we refer to as boxes. In particular, at each sampling epoch, the learner selects a box, which in turn causes an arm to get pulled as per a box-specific probability distribution. The pulled a… ▽ More We study best arm identification in a variant of the multi-armed bandit problem where the learner has limited precision in arm selection. The learner can only sample arms via certain exploration bundles, which we refer to as boxes. In particular, at each sampling epoch, the learner selects a box, which in turn causes an arm to get pulled as per a box-specific probability distribution. The pulled arm and its instantaneous reward are revealed to the learner, whose goal is to find the best arm by minimising the expected stop** time, subject to an upper bound on the error probability. We present an asymptotic lower bound on the expected stop** time, which holds as the error probability vanishes. We show that the optimal allocation suggested by the lower bound is, in general, non-unique and therefore challenging to track. We propose a modified tracking-based algorithm to handle non-unique optimal allocations, and demonstrate that it is asymptotically optimal. We also present non-asymptotic lower and upper bounds on the stop** time in the simpler setting when the arms accessible from one box do not overlap with those of others. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: ISIT 2023

arXiv:2304.12902 [pdf, other]

On the ubiquity of duopolies in constant sum congestion games

Authors: Shiksha Singhal, Veeraruna Kavitha, Jayakrishnan Nair

Abstract: We analyse a coalition formation game between strategic service providers of a congestible service. The key novelty of our formulation is that it is a constant sum game, i.e., the total payoff across all service providers (or coalitions of providers) is fixed, and dictated by the size of the market. The game thus captures the tension between resource pooling (to benefit from the resulting statisti… ▽ More We analyse a coalition formation game between strategic service providers of a congestible service. The key novelty of our formulation is that it is a constant sum game, i.e., the total payoff across all service providers (or coalitions of providers) is fixed, and dictated by the size of the market. The game thus captures the tension between resource pooling (to benefit from the resulting statistical economies of scale) and competition between coalitions over market share. In a departure from the prior literature on resource pooling for congestible services, we show that the grand coalition is in general not stable, once we allow for competition over market share. In fact, under classical notions of stability (defined via blocking by any coalition), we show that no partition is stable. This motivates us to introduce more restricted (and relevant) notions of blocking; interestingly, we find that the stable configurations under these novel notions of stability are duopolies, where the dominant coalition exploits its economies of scale to corner a disproportionate market share. Furthermore, we completely characterise the stable duopolies in heavy and light traffic regimes. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: arXiv admin note: text overlap with arXiv:2109.12840

arXiv:2302.06591 [pdf, other]

doi 10.1109/CCTA54093.2023.10253334

Local retail electricity markets for distribution grid services

Authors: Vineet Jagadeesan Nair, Anuradha Annaswamy

Abstract: We propose a hierarchical local electricity market (LEM) at the primary and secondary feeder levels in a distribution grid, to optimally coordinate and schedule distributed energy resources (DER) and provide valuable grid services like voltage control. At the primary level, we use a current injection-based model that is valid for both radial and meshed, balanced and unbalanced, multi-phase systems… ▽ More We propose a hierarchical local electricity market (LEM) at the primary and secondary feeder levels in a distribution grid, to optimally coordinate and schedule distributed energy resources (DER) and provide valuable grid services like voltage control. At the primary level, we use a current injection-based model that is valid for both radial and meshed, balanced and unbalanced, multi-phase systems. The primary and secondary markets leverage the flexibility offered by DERs to optimize grid operation and maximize social welfare. Numerical simulations on an IEEE-123 bus modified to include DERs, show that the LEM successfully achieves voltage control and reduces overall network costs, while also allowing us to decompose the price and value associated with different grid services so as to accurately compensate DERs. △ Less

Submitted 11 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: 9 pages, 13 figures, Accepted to the 7th IEEE Conference on Control Technology and Applications (CCTA) 2023

arXiv:2212.12613 [pdf, other]

doi 10.1109/HPCA56546.2023.10070999

Scalable and Secure Row-Swap: Efficient and Safe Row Hammer Mitigation in Memory Systems

Authors: Jeonghyun Woo, Gururaj Saileshwar, Prashant J. Nair

Abstract: As Dynamic Random Access Memories (DRAM) scale, they are becoming increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM cells (aggressor rows), attackers can exploit inter-cell interference through Row Hammer to flip bits in neighboring rows (victim rows). A recent work, called Randomized Row-Swap (RRS), proposed proactively swap** aggressor rows with randomly selected rows… ▽ More As Dynamic Random Access Memories (DRAM) scale, they are becoming increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM cells (aggressor rows), attackers can exploit inter-cell interference through Row Hammer to flip bits in neighboring rows (victim rows). A recent work, called Randomized Row-Swap (RRS), proposed proactively swap** aggressor rows with randomly selected rows before an aggressor row can cause Row Hammer. Our paper observes that RRS is neither secure nor scalable. We first propose the `Juggernaut attack pattern' that breaks RRS in under 1 day. Juggernaut exploits the fact that the mitigative action of RRS, a swap operation, can itself induce additional target row activations, defeating such a defense. Second, this paper proposes a new defense Secure Row-Swap mechanism that avoids the additional activations from swap (and unswap) operations and protects against Juggernaut. Furthermore, this paper extends Secure Row-Swap with attack detection to defend against even future attacks. While this provides better security, it also allows for securely reducing the frequency of swaps, thereby enabling Scalable and Secure Row-Swap. The Scalable and Secure Row-Swap mechanism provides years of Row Hammer protection with 3.3X lower storage overheads as compared to the RRS design. It incurs only a 0.7% slowdown as compared to a not-secure baseline for a Row Hammer threshold of 1200. △ Less

Submitted 23 December, 2022; originally announced December 2022.

Journal ref: The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2022)

arXiv:2211.14768 [pdf, ps, other]

Constrained Pure Exploration Multi-Armed Bandits with a Fixed Budget

Authors: Fathima Zarin Faizal, Jayakrishnan Nair

Abstract: We consider a constrained, pure exploration, stochastic multi-armed bandit formulation under a fixed budget. Each arm is associated with an unknown, possibly multi-dimensional distribution and is described by multiple attributes that are a function of this distribution. The aim is to optimize a particular attribute subject to user-defined constraints on the other attributes. This framework models… ▽ More We consider a constrained, pure exploration, stochastic multi-armed bandit formulation under a fixed budget. Each arm is associated with an unknown, possibly multi-dimensional distribution and is described by multiple attributes that are a function of this distribution. The aim is to optimize a particular attribute subject to user-defined constraints on the other attributes. This framework models applications such as financial portfolio optimization, where it is natural to perform risk-constrained maximization of mean return. We assume that the attributes can be estimated using samples from the arms' distributions and that these estimators satisfy suitable concentration inequalities. We propose an algorithm called \textsc{Constrained-SR} based on the Successive Rejects framework, which recommends an optimal arm and flags the instance as being feasible or infeasible. A key feature of this algorithm is that it is designed on the basis of an information theoretic lower bound for two-armed instances. We characterize an instance-dependent upper bound on the probability of error under \textsc{Constrained-SR}, that decays exponentially with respect to the budget. We further show that the associated decay rate is nearly optimal relative to an information theoretic lower bound in certain special cases. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: 14 pages

arXiv:2207.11656 [pdf, other]

Non-asymptotic near optimal algorithms for two sided matchings

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: A two-sided matching system is considered, where servers are assumed to arrive at a fixed rate, while the arrival rate of customers is modulated via a price-control mechanism. We analyse a loss model, wherein customers who are not served immediately upon arrival get blocked, as well as a queueing model, wherein customers wait in a queue until they receive service. The objective is to maximize the… ▽ More A two-sided matching system is considered, where servers are assumed to arrive at a fixed rate, while the arrival rate of customers is modulated via a price-control mechanism. We analyse a loss model, wherein customers who are not served immediately upon arrival get blocked, as well as a queueing model, wherein customers wait in a queue until they receive service. The objective is to maximize the platform profit generated from matching servers and customers, subject to quality of service constraints, such as the expected wait time of servers in the loss system model, and the stability of the customer queue in the queuing model. For the loss system, subject to a certain relaxation, we show that the optimal policy has a bang-bang structure. We also derive approximation guarantees for simple pricing policies. For the queueing system, we propose a simple bi-modal matching strategy and show that it achieves near optimal profit. △ Less

Submitted 24 July, 2022; originally announced July 2022.

arXiv:2207.10793 [pdf, other]

doi 10.1145/3630614.3630616

The Dirty Secret of SSDs: Embodied Carbon

Authors: Swamit Tannu, Prashant J. Nair

Abstract: Scalable Solid-State Drives (SSDs) have ushered in a transformative era in data storage and accessibility, spanning both data centers and portable devices. However, the strides made in scaling this technology can bear significant environmental consequences. On a global scale, a notable portion of semiconductor manufacturing relies on electricity derived from coal and natural gas sources. A strikin… ▽ More Scalable Solid-State Drives (SSDs) have ushered in a transformative era in data storage and accessibility, spanning both data centers and portable devices. However, the strides made in scaling this technology can bear significant environmental consequences. On a global scale, a notable portion of semiconductor manufacturing relies on electricity derived from coal and natural gas sources. A striking example of this is the manufacturing process for a single Gigabyte of Flash memory, which emits approximately 0.16 Kg of CO2 - a considerable fraction of the total carbon emissions attributed to the system. Remarkably, the manufacturing of storage devices alone contributed to an estimated 20 million metric tonnes of CO2 emissions in the year 2021. In light of these environmental concerns, this paper delves into an analysis of the sustainability trade-offs inherent in Solid-State Drives (SSDs) when compared to traditional Hard Disk Drives (HDDs). Moreover, this study proposes methodologies to gauge the embodied carbon costs associated with storage systems effectively. The research encompasses four key strategies to enhance the sustainability of storage systems. In summation, this paper critically addresses the embodied carbon issues associated with SSDs, comparing them with HDDs, and proposes a comprehensive framework of strategies to enhance the sustainability of storage systems. △ Less

Submitted 28 September, 2023; v1 submitted 8 July, 2022; originally announced July 2022.

Journal ref: Energy Informatics Review (Volume 3 Issue 3, October 2023)

arXiv:2207.09372 [pdf, other]

On Decentralizing Federated Reinforcement Learning in Multi-Robot Scenarios

Authors: Jayprakash S. Nair, Divya D. Kulkarni, Ajitem Joshi, Sruthy Suresh

Abstract: Federated Learning (FL) allows for collaboratively aggregating learned information across several computing devices and sharing the same amongst them, thereby tackling issues of privacy and the need of huge bandwidth. FL techniques generally use a central server or cloud for aggregating the models received from the devices. Such centralized FL techniques suffer from inherent problems such as failu… ▽ More Federated Learning (FL) allows for collaboratively aggregating learned information across several computing devices and sharing the same amongst them, thereby tackling issues of privacy and the need of huge bandwidth. FL techniques generally use a central server or cloud for aggregating the models received from the devices. Such centralized FL techniques suffer from inherent problems such as failure of the central node and bottlenecks in channel bandwidth. When FL is used in conjunction with connected robots serving as devices, a failure of the central controlling entity can lead to a chaotic situation. This paper describes a mobile agent based paradigm to decentralize FL in multi-robot scenarios. Using Webots, a popular free open-source robot simulator, and Tartarus, a mobile agent platform, we present a methodology to decentralize federated learning in a set of connected robots. With Webots running on different connected computing systems, we show how mobile agents can perform the task of Decentralized Federated Reinforcement Learning (dFRL). Results obtained from experiments carried out using Q-learning and SARSA by aggregating their corresponding Q-tables, show the viability of using decentralized FL in the domain of robotics. Since the proposed work can be used in conjunction with other learning algorithms and also real robots, it can act as a vital tool for the study of decentralized FL using heterogeneous learning algorithms concurrently in multi-robot scenarios. △ Less

Submitted 7 September, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

Comments: Submitted to SEEDA 2022. This arxiv is a preprint and NOT the final version

arXiv:2207.01988 [pdf, other]

Unsupervised Crowdsourcing with Accuracy and Cost Guarantees

Authors: Yashvardhan Didwania, Jayakrishnan Nair, N. Hemachandra

Abstract: We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items, given a prescribed error threshold. Workers on the crowdsourcing platform are assumed to be divided into multiple classes, based on their skill, experience, and/or past performance. We model each worker class via an unknown confusion matrix, and a (known… ▽ More We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items, given a prescribed error threshold. Workers on the crowdsourcing platform are assumed to be divided into multiple classes, based on their skill, experience, and/or past performance. We model each worker class via an unknown confusion matrix, and a (known) price to be paid per label prediction. For this setting, we propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items. We prove that if the number of (unlabeled) items available is large enough, our algorithms satisfy the prescribed error thresholds, incurring a cost that is near-optimal. Finally, we validate our algorithms, and some heuristics inspired by them, through an extensive case study. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: To be presented at WiOpt 2022

arXiv:2204.05436 [pdf, other]

Heterogeneous Acceleration Pipeline for Recommendation System Training

Authors: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

Abstract: Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer tim… ▽ More Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline. △ Less

Submitted 28 April, 2024; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted at The International Symposium on Computer Architecture (ISCA), 2024

arXiv:2111.08535 [pdf, other]

Sequential Community Mode Estimation

Authors: Shubham Anand Jain, Shreyas Goenka, Divyam Bapna, Nikhil Karamchandani, Jayakrishnan Nair

Abstract: We consider a population, partitioned into a set of communities, and study the problem of identifying the largest community within the population via sequential, random sampling of individuals. There are multiple sampling domains, referred to as \emph{boxes}, which also partition the population. Each box may consist of individuals of different communities, and each community may in turn be spread… ▽ More We consider a population, partitioned into a set of communities, and study the problem of identifying the largest community within the population via sequential, random sampling of individuals. There are multiple sampling domains, referred to as \emph{boxes}, which also partition the population. Each box may consist of individuals of different communities, and each community may in turn be spread across multiple boxes. The learning agent can, at any time, sample (with replacement) a random individual from any chosen box; when this is done, the agent learns the community the sampled individual belongs to, and also whether or not this individual has been sampled before. The goal of the agent is to minimize the probability of mis-identifying the largest community in a \emph{fixed budget} setting, by optimizing both the sampling strategy as well as the decision rule. We propose and analyse novel algorithms for this problem, and also establish information theoretic lower bounds on the probability of error under any algorithm. In several cases of interest, the exponential decay rates of the probability of error under our algorithms are shown to be optimal up to constant factors. The proposed algorithms are further validated via simulations on real-world datasets. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: Presented in part at Performance'21. Full version in Elsevier Performance Evaluation, Dec. 21

arXiv:2109.12840 [pdf, other]

Coalition Formation in Constant Sum Queueing Games

Authors: Shiksha Singhal, Veeraruna Kavitha, Jayakrishnan Nair

Abstract: We analyse a coalition formation game between strategic service providers of a congestible service. The key novelty of our formulation is that it is a constant sum game, i.e., the total payoff across all service providers (or coalitions of providers) is fixed, and dictated by the total size of the market. The game thus captures the tension between resource pooling (to benefit from the resulting st… ▽ More We analyse a coalition formation game between strategic service providers of a congestible service. The key novelty of our formulation is that it is a constant sum game, i.e., the total payoff across all service providers (or coalitions of providers) is fixed, and dictated by the total size of the market. The game thus captures the tension between resource pooling (to benefit from the resulting statistical economies of scale) and competition between coalitions over market share. In a departure from the prior literature on resource pooling for congestible services, we show that the grand coalition is in general not stable, once we allow for competition over market share. Instead, the stable configurations are duopolies, where the dominant coalition exploits its economies of scale to corner a disproportionate market share. We analyse the stable duopolies that emerge from this interaction, and also study a dynamic variant of this game. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: 15 pages, 3 figures

arXiv:2109.07137 [pdf, other]

Optimal Cycling of a Heterogenous Battery Bank via Reinforcement Learning

Authors: Vivek Deulkar, Jayakrishnan Nair

Abstract: We consider the problem of optimal charging/discharging of a bank of heterogenous battery units, driven by stochastic electricity generation and demand processes. The batteries in the battery bank may differ with respect to their capacities, ramp constraints, losses, as well as cycling costs. The goal is to minimize the degradation costs associated with battery cycling in the long run; this is pos… ▽ More We consider the problem of optimal charging/discharging of a bank of heterogenous battery units, driven by stochastic electricity generation and demand processes. The batteries in the battery bank may differ with respect to their capacities, ramp constraints, losses, as well as cycling costs. The goal is to minimize the degradation costs associated with battery cycling in the long run; this is posed formally as a Markov decision process. We propose a linear function approximation based Q-learning algorithm for learning the optimal solution, using a specially designed class of kernel functions that approximate the structure of the value functions associated with the MDP. The proposed algorithm is validated via an extensive case study. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: Appeared on IEEE SmartGridComm 2021 conference

arXiv:2108.06935 [pdf, ps, other]

Speed Scaling with Multiple Servers Under A Sum Power Constraint

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: The problem of scheduling jobs and choosing their respective speeds with multiple servers under a sum power constraint to minimize the flow time + energy is considered. This problem is a generalization of the flow time minimization problem with multiple unit-speed servers, when jobs can be parallelized, however, with a sub-linear, concave speedup function $k^{1/α}, α>1$ when allocated $k$ servers,… ▽ More The problem of scheduling jobs and choosing their respective speeds with multiple servers under a sum power constraint to minimize the flow time + energy is considered. This problem is a generalization of the flow time minimization problem with multiple unit-speed servers, when jobs can be parallelized, however, with a sub-linear, concave speedup function $k^{1/α}, α>1$ when allocated $k$ servers, i.e., jobs experience diminishing returns from being allocated additional servers. When all jobs are available at time $0$, we show that a very simple algorithm EQUI, that processes all available jobs at the same speed is $\left(2-\frac{1}α\right) \frac{2}{\left(1-\left(\frac{1}α\right)\right)}$-competitive, while in the general case, when jobs arrive over time, an LCFS based algorithm is shown to have a constant (dependent only on $α$) competitive ratio. △ Less

Submitted 18 August, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: To appear in Performance 2021

arXiv:2105.08967 [pdf, ps, other]

Speed Scaling On Parallel Servers with MapReduce Type Precedence Constraints

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: A multiple server setting is considered, where each server has tunable speed, and increasing the speed incurs an energy cost. Jobs arrive to a single queue, and each job has two types of sub-tasks, map and reduce, and a {\bf precedence} constraint among them: any reduce task of a job can only be processed once all the map tasks of the job have been completed. In addition to the scheduling problem,… ▽ More A multiple server setting is considered, where each server has tunable speed, and increasing the speed incurs an energy cost. Jobs arrive to a single queue, and each job has two types of sub-tasks, map and reduce, and a {\bf precedence} constraint among them: any reduce task of a job can only be processed once all the map tasks of the job have been completed. In addition to the scheduling problem, i.e., which task to execute on which server, with tunable speed, an additional decision variable is the choice of speed for each server, so as to minimize a linear combination of the sum of the flow times of jobs/tasks and the total energy cost. The precedence constraints present new challenges for the speed scaling problem with multiple servers, namely that the number of tasks that can be executed at any time may be small but the total number of outstanding tasks might be quite large. We present simple speed scaling algorithms that are shown to have competitive ratios, that depend on the power cost function, and/or the ratio of the size of the largest task and the shortest reduce task, but not on the number of jobs, or the number of servers. △ Less

Submitted 19 May, 2021; originally announced May 2021.

arXiv:2103.00686 [pdf, other]

doi 10.14778/3485450.3485462

Accelerating Recommendation System Training by Leveraging Popular Choices

Authors: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

Abstract: Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation model… ▽ More Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy △ Less

Submitted 28 September, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

ACM Class: I.2.6; C.5.0

Journal ref: Proceedings of the VLDB Endowment, 2022

arXiv:2008.13629 [pdf, other]

Statistically Robust, Risk-Averse Best Arm Identification in Multi-Armed Bandits

Authors: Anmol Kagrecha, Jayakrishnan Nair, Krishna Jagannathan

Abstract: Traditional multi-armed bandit (MAB) formulations usually make certain assumptions about the underlying arms' distributions, such as bounds on the support or their tail behaviour. Moreover, such parametric information is usually 'baked' into the algorithms. In this paper, we show that specialized algorithms that exploit such parametric information are prone to inconsistent learning performance whe… ▽ More Traditional multi-armed bandit (MAB) formulations usually make certain assumptions about the underlying arms' distributions, such as bounds on the support or their tail behaviour. Moreover, such parametric information is usually 'baked' into the algorithms. In this paper, we show that specialized algorithms that exploit such parametric information are prone to inconsistent learning performance when the parameter is misspecified. Our key contributions are twofold: (i) We establish fundamental performance limits of statistically robust MAB algorithms under the fixed-budget pure exploration setting, and (ii) We propose two classes of algorithms that are asymptotically near-optimal. Additionally, we consider a risk-aware criterion for best arm identification, where the objective associated with each arm is a linear combination of the mean and the conditional value at risk (CVaR). Throughout, we make a very mild 'bounded moment' assumption, which lets us work with both light-tailed and heavy-tailed distributions within a unified framework. △ Less

Submitted 27 March, 2022; v1 submitted 28 August, 2020; originally announced August 2020.

Comments: 21 pages. Preliminary version appeared in NeurIPS 2019. Accepted for publication at IEEE Transactions of Information Theory. arXiv admin note: text overlap with arXiv:1906.00569

arXiv:2006.12038 [pdf, ps, other]

Bandit algorithms: Letting go of logarithmic regret for statistical robustness

Authors: Kumar Ashutosh, Jayakrishnan Nair, Anmol Kagrecha, Krishna Jagannathan

Abstract: We study regret minimization in a stochastic multi-armed bandit setting and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms' distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super… ▽ More We study regret minimization in a stochastic multi-armed bandit setting and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms' distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super-logarithmic regret. This result highlights the inevitable statistical fragility of all `logarithmic regret' bandit algorithms available in the literature---for instance, if a UCB algorithm designed for $σ$-subGaussian distributions is used in a subGaussian setting with a mismatched variance parameter, the learning performance could be inconsistent. Next, we show a positive result: statistically robust and consistent learning performance is attainable if we allow the regret to be slightly worse than logarithmic. Specifically, we propose three classes of distribution oblivious algorithms that achieve an asymptotic regret that is arbitrarily close to logarithmic. △ Less

Submitted 22 June, 2020; originally announced June 2020.

arXiv:2006.09649 [pdf, other]

Constrained regret minimization for multi-criterion multi-armed bandits

Authors: Anmol Kagrecha, Jayakrishnan Nair, Krishna Jagannathan

Abstract: We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondar… ▽ More We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondary' attributes. We assume that the attributes can be estimated using samples from the arms' distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a Boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful. △ Less

Submitted 3 January, 2023; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: 26 pages

arXiv:1912.10619 [pdf, other]

Adaptive flow-level scheduling for the IoT MAC

Authors: Pragya Sharma, Jayakrishnan Nair, Raman Singh

Abstract: Over the past decade, distributed CSMA, which forms the basis for WiFi, has been deployed ubiquitously to provide seamless and high-speed mobile internet access. However, distributed CSMA might not be ideal for future IoT/M2M applications, where the density of connected devices/sensors/controllers is expected to be orders of magnitude higher than that in present wireless networks. In such high-den… ▽ More Over the past decade, distributed CSMA, which forms the basis for WiFi, has been deployed ubiquitously to provide seamless and high-speed mobile internet access. However, distributed CSMA might not be ideal for future IoT/M2M applications, where the density of connected devices/sensors/controllers is expected to be orders of magnitude higher than that in present wireless networks. In such high-density networks, the overhead associated with completely distributed MAC protocols will become a bottleneck. Moreover, IoT communications are likely to have strict QoS requirements, for which the `best-effort' scheduling by present WiFi networks may be unsuitable. This calls for a clean-slate redesign of the wireless MAC taking into account the requirements for future IoT/M2M networks. In this paper, we propose a reservation-based (for minimal overhead) wireless MAC designed specifically with IoT/M2M applications in mind. The key features include: (i) flow-level, rather than packet level contention to minimize overhead, (ii) deadline aware, reservation based scheduling, and (iii) the ability to dynamically adapt the MAC parameters with changing workload. △ Less

Submitted 23 December, 2019; originally announced December 2019.

arXiv:1910.12894 [pdf, other]

Please come back later: Benefiting from deferrals in service systems

Authors: Anmol Kagrecha, Jayakrishnan Nair

Abstract: The performance evaluation of loss service systems, where customers who cannot be served upon arrival get dropped, has a long history going back to the classical Erlang B model. In this paper, we consider the performance benefits arising from the possibility of deferring customers who cannot be served upon arrival. Specifically, we consider an Erlang B type loss system where the system operator ca… ▽ More The performance evaluation of loss service systems, where customers who cannot be served upon arrival get dropped, has a long history going back to the classical Erlang B model. In this paper, we consider the performance benefits arising from the possibility of deferring customers who cannot be served upon arrival. Specifically, we consider an Erlang B type loss system where the system operator can, subject to certain constraints, ask a customer arriving when all servers are busy, to come back at a specified time in the future. If the system is still fully loaded when the deferred customer returns, she gets dropped for good. For such a system, we ask: How should the system operator determine the rearrival times of the deferred customers based on the state of the system (which includes those customers already deferred and yet to arrive)? How does one quantify the performance benefit of such a deferral policy? Our contributions are as follows. We propose a simple state-dependent policy for determining the rearrival times of deferred customers. For this policy, we characterize the long run fraction of customers dropped. We also analyse a relaxation where the deferral times are bounded in expectation. Via extensive numerical evaluations, we demonstrate the superiority of the proposed state-dependent policies over naive state-independent deferral policies. △ Less

Submitted 28 October, 2019; originally announced October 2019.

arXiv:1909.00553 [pdf, ps, other]

doi 10.1145/3352460.3358281

Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads

Authors: Seokin Hong, Bulent Abali, Alper Buyuktosunoglu, Michael B. Healy, Prashant J. Nair

Abstract: Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This… ▽ More Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This paper proposes Touché, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touché framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline. △ Less

Submitted 2 September, 2019; originally announced September 2019.

Comments: Keywords: Compression, Caches, Tag Array, Data Array, Hashing

Journal ref: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, October 2019, Pages 453-465

arXiv:1908.09580 [pdf, other]

Revenue Sharing in the Internet: A Moral Hazard Approach and a Net-neutrality Perspective

Authors: Fehmina Malik, Manjesh K. ~Hanawal, Yezekael Hayel, Jayakrishnan Nair

Abstract: Revenue sharing contracts between Content Providers (CPs) and Internet Service Providers (ISPs) can act as leverage for enhancing the infrastructure of the Internet. ISPs can be incentivized to make investments in network infrastructure that improve Quality of Service (QoS) for users if attractive contracts are negotiated between them and CPs. The idea here is that part of the net profit gained by… ▽ More Revenue sharing contracts between Content Providers (CPs) and Internet Service Providers (ISPs) can act as leverage for enhancing the infrastructure of the Internet. ISPs can be incentivized to make investments in network infrastructure that improve Quality of Service (QoS) for users if attractive contracts are negotiated between them and CPs. The idea here is that part of the net profit gained by CPs are given to ISPs to invest in the network. The Moral Hazard economic framework is used to model such an interaction, in which a principal determines a contract, and an agent reacts by adapting her effort. In our setting, several competitive CPs interact through one common ISP. Two cases are studied: (i) the ISP differentiates between the CPs and makes a (potentially) different investment to improve the QoS of each CP, and (ii) the ISP does not differentiate between CPs and makes a common investment for both. The last scenario can be viewed as \emph{network neutral behavior} on the part of the ISP. We analyse the optimal contracts and show that the CP that can better monetize its demand always prefers the non-neutral regime. Interestingly, ISP revenue, as well as social utility, are also found to be higher under the non-neutral regime. △ Less

Submitted 26 August, 2019; originally announced August 2019.

arXiv:1907.09049 [pdf, ps, other]

Multiple Server SRPT with speed scaling is competitive

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: Can the popular shortest remaining processing time (SRPT) algorithm achieve a constant competitive ratio on multiple servers when server speeds are adjustable (speed scaling) with respect to the flow time plus energy consumption metric? This question has remained open for a while, where a negative result in the absence of speed scaling is well known. The main result of this paper is to show that m… ▽ More Can the popular shortest remaining processing time (SRPT) algorithm achieve a constant competitive ratio on multiple servers when server speeds are adjustable (speed scaling) with respect to the flow time plus energy consumption metric? This question has remained open for a while, where a negative result in the absence of speed scaling is well known. The main result of this paper is to show that multi-server SRPT can be constant competitive, with a competitive ratio that only depends on the power-usage function of the servers, but not on the number of jobs/servers or the job sizes (unlike when speed scaling is not allowed). When all job sizes are unity, we show that round-robin routing is optimal and can achieve the same competitive ratio as the best known algorithm for the single server problem. Finally, we show that a class of greedy dispatch policies, including policies that route to the least loaded or the shortest queue, do not admit a constant competitive ratio. When job arrivals are stochastic, with Poisson arrivals and i.i.d. job sizes, we show that random routing and a simple gated-static speed scaling algorithm achieves a constant competitive ratio. △ Less

Submitted 5 May, 2020; v1 submitted 21 July, 2019; originally announced July 2019.

Comments: To appear in IEEE/ACM Transactions on Networking

arXiv:1907.04498 [pdf, other]

Speed Scaling with Tandem Servers

Authors: Rahul Vaze, Jayakrishnan Nair

Abstract: Speed scaling for a tandem server setting is considered, where there is a series of servers, and each job has to be processed by each of the servers in sequence. Servers have a variable speed, their power consumption being a convex increasing function of the speed. We consider the worst case setting as well as the stochastic setting. In the worst case setting, the jobs are assumed to be of unit si… ▽ More Speed scaling for a tandem server setting is considered, where there is a series of servers, and each job has to be processed by each of the servers in sequence. Servers have a variable speed, their power consumption being a convex increasing function of the speed. We consider the worst case setting as well as the stochastic setting. In the worst case setting, the jobs are assumed to be of unit size with arbitrary (possibly adversarially determined) arrival instants. For this problem, we devise an online speed scaling algorithm that is constant competitive with respect to the optimal offline algorithm that has non-causal information. The proposed algorithm, at all times, uses the same speed on all active servers, such that the total power consumption equals the number of outstanding jobs. In the stochastic setting, we consider a more general tandem network, with a parallel bank of servers at each stage. In this setting, we show that random routing with a simple gated static speed selection is constant competitive. In both cases, the competitive ratio depends only on the power functions, and is independent of the workload and the number of servers. △ Less

Submitted 9 July, 2019; originally announced July 2019.

arXiv:1906.03587 [pdf, other]

Partial Server Pooling in Redundancy Systems

Authors: Akshay Mete, D. Manjunath, Jayakrishnan Nair, Balakrishna Prabhu

Abstract: Partial sharing allows providers to possibly pool a fraction of their resources when full pooling is not beneficial to them. Recent work in systems without sharing has shown that redundancy can improve performance considerably. In this paper, we combine partial sharing and redundancy by develo** partial sharing models for providers operating multi-server systems with redundancy. Two M/M/N queues… ▽ More Partial sharing allows providers to possibly pool a fraction of their resources when full pooling is not beneficial to them. Recent work in systems without sharing has shown that redundancy can improve performance considerably. In this paper, we combine partial sharing and redundancy by develo** partial sharing models for providers operating multi-server systems with redundancy. Two M/M/N queues with redundant service models are considered. Copies of an arriving job are placed in the queues of servers that can serve the job. Partial sharing models for cancel-on-complete and cancel-on-start redundancy models are developed. For cancel-on-complete, it is shown that the Pareto efficient region is the full pooling configuration. For a cancel-on-start policy, we conjecture that the Pareto frontier is always non-empty and is such that at least one of the two providers is sharing all of its resources. For this system, using bargaining theory the sharing configuration that the providers may use is determined. Mean response time and probability of waiting are the performance metrics considered. △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1906.00581 [pdf, ps, other]

Sponsored data with ISP competition

Authors: Pooja Vyavahare, D. Manjunath, Jayakrishnan Nair

Abstract: We analyze the effect of sponsored data platforms when Internet service providers (ISPs) compete for subscribers and content providers (CPs) compete for a share of the bandwidth usage by the customers. Our analytical model is of a full information, leader-follower game. ISPs lead and set prices for sponsorship. CPs then make the binary decision of sponsoring or not sponsoring their content on the… ▽ More We analyze the effect of sponsored data platforms when Internet service providers (ISPs) compete for subscribers and content providers (CPs) compete for a share of the bandwidth usage by the customers. Our analytical model is of a full information, leader-follower game. ISPs lead and set prices for sponsorship. CPs then make the binary decision of sponsoring or not sponsoring their content on the ISPs. Lastly, based on both of these, users make a two-part decision of choosing the ISP to which they subscribe, and the amount of data to consume from each of the CPs through the chosen ISP. User consumption is determined by a utility maximization framework, the sponsorship decision is determined by a non-cooperative game between the CPs, and the ISPs set their prices to maximize their profit in response to the prices set by the competing ISP. We analyze the pricing dynamics of the prices set by the ISPs, the sponsorship decisions that the CPs make and the market structure therein, and the surpluses of the ISPs, CPs, and users. This is the first analysis of the effect sponsored data platforms in the presence of ISP competition. We show that inter-ISP competition does not inhibit ISPs from extracting a significant fraction of the CP surplus. Moreover, the ISPs often have an incentive to significantly skew the CP marketplace in favor of the most profitable CP. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1906.00569 [pdf, other]

Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards

Authors: Anmol Kagrecha, Jayakrishnan Nair, Krishna Jagannathan

Abstract: Classical multi-armed bandit problems use the expected value of an arm as a metric to evaluate its goodness. However, the expected value is a risk-neutral metric. In many applications like finance, one is interested in balancing the expected return of an arm (or portfolio) with the risk associated with that return. In this paper, we consider the problem of selecting the arm that optimizes a linear… ▽ More Classical multi-armed bandit problems use the expected value of an arm as a metric to evaluate its goodness. However, the expected value is a risk-neutral metric. In many applications like finance, one is interested in balancing the expected return of an arm (or portfolio) with the risk associated with that return. In this paper, we consider the problem of selecting the arm that optimizes a linear combination of the expected reward and the associated Conditional Value at Risk (CVaR) in a fixed budget best-arm identification framework. We allow the reward distributions to be unbounded or even heavy-tailed. For this problem, our goal is to devise algorithms that are entirely distribution oblivious, i.e., the algorithm is not aware of any information on the reward distributions, including bounds on the moments/tails, or the suboptimality gaps across arms. In this paper, we provide a class of such algorithms with provable upper bounds on the probability of incorrect identification. In the process, we develop a novel estimator for the CVaR of unbounded (including heavy-tailed) random variables and prove a concentration inequality for the same, which could be of independent interest. We also compare the error bounds for our distribution oblivious algorithms with those corresponding to standard non-oblivious algorithms. Finally, numerical experiments reveal that our algorithms perform competitively when compared with non-oblivious algorithms, suggesting that distribution obliviousness can be realised in practice without incurring a significant loss of performance. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1904.06480 [pdf, other]

Dynamic scheduling in a partially fluid, partially lossy queueing system

Authors: Kiran Chaudhary, Veeraruna Kavitha, Jayakrishnan Nair

Abstract: We consider a single server queueing system with two classes of jobs: eager jobs with small sizes that require service to begin almost immediately upon arrival, and tolerant jobs with larger sizes that can wait for service. While blocking probability is the relevant performance metric for the eager class, the tolerant class seeks to minimize its mean sojourn time. In this paper, we discuss the per… ▽ More We consider a single server queueing system with two classes of jobs: eager jobs with small sizes that require service to begin almost immediately upon arrival, and tolerant jobs with larger sizes that can wait for service. While blocking probability is the relevant performance metric for the eager class, the tolerant class seeks to minimize its mean sojourn time. In this paper, we discuss the performance of each class under dynamic scheduling policies, where the scheduling of both classes depends on the instantaneous state of the system. This analysis is carried out under a certain fluid limit, where the arrival rate and service rate of the eager class are scaled to infinity, holding the offered load constant. Our performance characterizations reveal a (dynamic) pseudo-conservation law that ties the performance of both the classes to the standalone blocking probabilities of the eager class. Further, the performance is robust to other specifics of the scheduling policies. We also characterize the Pareto frontier of the achievable region of performance vectors under the same fluid limit, and identify a (two-parameter) class of Pareto-complete scheduling policies. △ Less

Submitted 23 December, 2021; v1 submitted 13 April, 2019; originally announced April 2019.

arXiv:1808.06175 [pdf, other]

doi 10.1109/TNET.2019.2918164

Sharing within limits: Partial resource pooling in loss systems

Authors: Anvitha Nandigam, Suraj Jog, D. Manjunath, Jayakrishnan Nair, B. J. Prabhu

Abstract: Fragmentation of expensive resources, e.g., spectrum for wireless services, between providers can introduce inefficiencies in resource utilisation and worsen overall system performance. In such cases, resource pooling between independent service providers can be used to improve performance. However, for providers to agree to pool their resources, the arrangement has to be mutually beneficial. The… ▽ More Fragmentation of expensive resources, e.g., spectrum for wireless services, between providers can introduce inefficiencies in resource utilisation and worsen overall system performance. In such cases, resource pooling between independent service providers can be used to improve performance. However, for providers to agree to pool their resources, the arrangement has to be mutually beneficial. The traditional notion of resource pooling, which implies complete sharing, need not have this property. For example, under full pooling, one of the providers may be worse off and hence have no incentive to participate. In this paper, we propose partial resource sharing models as a generalization of full pooling, which can be configured to be beneficial to all participants. We formally define and analyze two partial sharing models between two service providers, each of which is an Erlang-B loss system with the blocking probabilities as the performance measure. We show that there always exist partial sharing configurations that are beneficial to both providers, irrespective of the load and the number of circuits of each of the providers. A key result is that the Pareto frontier has at least one of the providers sharing all its resources with the other. Furthermore, full pooling may not lie inside this Pareto set. The choice of the sharing configurations within the Pareto set is formalized based on bargaining theory. Finally, large system approximations of the blocking probabilities in the quality-efficiency-driven regime are presented. △ Less

Submitted 19 August, 2018; originally announced August 2018.

arXiv:1805.10831 [pdf, ps, other]

On QoS-Compliant Telehaptic Communication over Shared Networks

Authors: Vineet Gokhale, Jayakrishnan Nair, Subhasis Chaudhuri, Jan Fesl

Abstract: The development of communication protocols for teleoperation with force feedback (generally known as telehaptics) has gained widespread interest over the past decade. Several protocols have been proposed for performing telehaptic interaction over shared networks. However, a comprehensive analysis of the impact of network cross-traffic on telehaptic streams, and the feasibility of Quality of Servic… ▽ More The development of communication protocols for teleoperation with force feedback (generally known as telehaptics) has gained widespread interest over the past decade. Several protocols have been proposed for performing telehaptic interaction over shared networks. However, a comprehensive analysis of the impact of network cross-traffic on telehaptic streams, and the feasibility of Quality of Service (QoS) compliance is lacking in the literature. In this paper, we seek to fill this gap. Specifically, we explore the QoS experienced by two classes of telehaptic protocols on shared networks - Constant Bitrate (CBR) protocols and adaptive sampling based protocols, accounting for CBR as well as TCP cross-traffic. Our treatment of CBR-based telehaptic protocols is based on a micro-analysis of the interplay between TCP and CBR flows on a shared bottleneck link, which is broadly applicable for performance evaluation of CBR-based media streaming applications. Based on our analytical characterization of telehaptic QoS, and via extensive simulations and real network experiments, we formulate a set of sufficient conditions for telehaptic QoS-compliance. These conditions provide guidelines for designers of telehaptic protocols, and for network administrators to configure their networks for guaranteeing QoS-compliant telehaptic communication. △ Less

Submitted 28 May, 2018; originally announced May 2018.

Comments: 28 pages, 11 figures

arXiv:1805.03184 [pdf, other]

LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency

Authors: Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee, Moinuddin K. Qureshi, Onur Mutlu

Abstract: This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency a… ▽ More This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency and energy consumption. Prior work proposes to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations. △ Less

Submitted 8 May, 2018; originally announced May 2018.

arXiv:1704.03991 [pdf, ps, other]

Architectural Techniques to Enable Reliable and Scalable Memory Systems

Authors: Prashant J. Nair

Abstract: High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen… ▽ More High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while kee** the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads. △ Less

Submitted 13 April, 2017; originally announced April 2017.

Comments: PhD thesis, Georgia Institute of Technology (May 2017)

arXiv:1610.00609 [pdf, ps, other]

Congestion Control for Network-Aware Telehaptic Communication

Authors: Vineet Gokhale, Jayakrishnan Nair, Subhasis Chaudhuri

Abstract: Telehaptic applications involve delay-sensitive multimedia communication between remote locations with distinct Quality of Service (QoS) requirements for different media components. These QoS constraints pose a variety of challenges, especially when the communication occurs over a shared network, with unknown and time-varying cross-traffic. In this work, we propose a transport layer congestion con… ▽ More Telehaptic applications involve delay-sensitive multimedia communication between remote locations with distinct Quality of Service (QoS) requirements for different media components. These QoS constraints pose a variety of challenges, especially when the communication occurs over a shared network, with unknown and time-varying cross-traffic. In this work, we propose a transport layer congestion control protocol for telehaptic applications operating over shared networks, termed as dynamic packetization module (DPM). DPM is a lossless, network-aware protocol which tunes the telehaptic packetization rate based on the level of congestion in the network. To monitor the network congestion, we devise a novel network feedback module, which communicates the end-to-end delays encountered by the telehaptic packets to the respective transmitters with negligible overhead. Via extensive simulations, we show that DPM meets the QoS requirements of telehaptic applications over a wide range of network cross-traffic conditions. We also report qualitative results of a real-time telepottery experiment with several human subjects, which reveal that DPM preserves the quality of telehaptic activity even under heavily congested network scenarios. Finally, we compare the performance of DPM with several previously proposed telehaptic communication protocols and demonstrate that DPM outperforms these protocols. △ Less

Submitted 11 January, 2017; v1 submitted 3 October, 2016; originally announced October 2016.

Comments: 25 pages, 19 figures

arXiv:1412.7532 [pdf]

Toward Refactoring of DMARF and GIPSY Case Studies -- A Team XI SOEN6471-S14 Project Report

Authors: Zinia Das, Mohammad Iftekharul Hoque, Renuka Milkoori, Jithin Nair, Rohan Nayak, Swamy Yogya Reddy, Dhana Shree Sankini, Arslan Zaffar

Abstract: This report focuses on improving the internal structure of the Distributed Modular Audio recognition Framework (DMARF) and the General Intensional Programming System (GIPSY) case studies without affecting their original behavior. At first, the general principles, and the working of DMARF and GIPSY are understood by mainly stressing on the architecture of the systems by looking at their frameworks… ▽ More This report focuses on improving the internal structure of the Distributed Modular Audio recognition Framework (DMARF) and the General Intensional Programming System (GIPSY) case studies without affecting their original behavior. At first, the general principles, and the working of DMARF and GIPSY are understood by mainly stressing on the architecture of the systems by looking at their frameworks and running them in the Eclipse environment. To improve the quality of the structure of the code, a furtherance of understanding of the architecture of the case studies and this is achieved by analyzing the design patterns present in the code. The improvement is done by the identification and removal of code smells in the code of the case studies. Code smells are identified by analyzing the source code by using Logiscope and JDeodorant. Some refactoring techniques are suggested, out of which the best suited ones are implemented to improve the code. Finally, Test cases are implemented to check if the behavior of the code has changed or not. △ Less

Submitted 23 December, 2014; originally announced December 2014.

Comments: 84 pages, 69 figures, 7 tables

ACM Class: D.2; K.6; H.5.2

Showing 1–43 of 43 results for author: Nair, J