Search | arXiv e-print repository

arXiv:1909.10802 [pdf, other]

Gap Aware Mitigation of Gradient Staleness

Authors: Saar Barkai, Ido Hakimi, Assaf Schuster

Abstract: Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which… ▽ More Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up. △ Less

Submitted 3 February, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

Comments: Published as a conference paper at ICLR 2020

arXiv:1907.11612 [pdf, other]

Taming Momentum in a Distributed Asynchronous Environment

Authors: Ido Hakimi, Saar Barkai, Moshe Gabel, Assaf Schuster

Abstract: Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is o… ▽ More Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is often used to accelerate convergence and escape local minima, exacerbates the gradient staleness, thereby hindering convergence. We propose DANA: a novel technique for asynchronous distributed SGD with momentum that mitigates gradient staleness by computing the gradient on an estimated future position of the model's parameters. Thereby, we show for the first time that momentum can be fully incorporated in asynchronous training with almost no ramifications to final accuracy. Our evaluation on the CIFAR and ImageNet datasets shows that DANA outperforms existing methods, in both final accuracy and convergence speed while scaling up to a total batch size of 16K on 64 asynchronous workers. △ Less

Submitted 14 October, 2020; v1 submitted 26 July, 2019; originally announced July 2019.

arXiv:1803.06294 [pdf, other]

SDN for End-Nodes: Scenario Analysis and Architectural Guidelines

Authors: Alberto Rodriguez-Natal, Vina Ermagan, Kien Nguyen, Sharon Barkai, Yusheng Ji, Fabio Maino, Albert Cabellos-Aparicio

Abstract: The advent of SDN has brought a plethora of new architectures and controller designs for many use-cases and scenarios. Existing SDN deployments focus on campus, datacenter and WAN networks. However, little research efforts have been devoted to the scenario of effectively controlling a full deployment of end-nodes (e.g. smartphones) that are transient and scattered across the Internet. In this pape… ▽ More The advent of SDN has brought a plethora of new architectures and controller designs for many use-cases and scenarios. Existing SDN deployments focus on campus, datacenter and WAN networks. However, little research efforts have been devoted to the scenario of effectively controlling a full deployment of end-nodes (e.g. smartphones) that are transient and scattered across the Internet. In this paper, we present a rigorous analysis of the challenges associated with an SDN architecture for end-nodes, show that such challenges are not found in existing SDN scenarios, and provide practical design guidelines to address them. Then, and following these guidelines we present a reference architecture based on a decentralized, distributed and symmetric controller with a connectionless pull-oriented southbound and an intent-driven northbound. Finally, we measure a proof-of-concept deployment to assess the validity of the analysis as well as the architecture. △ Less

Submitted 16 March, 2018; originally announced March 2018.

arXiv:1606.06222 [pdf, other]

doi 10.1145/3138808.3138810

Knowledge-Defined Networking

Authors: Albert Mestres, Alberto Rodriguez-Natal, Josep Carner, Pere Barlet-Ros, Eduard Alarcón, Marc Solé, Victor Muntés, David Meyer, Sharon Barkai, Mike J Hibbett, Giovani Estrada, Khaldun Ma`ruf, Florin Coras, Vina Ermagan, Hugo Latapie, Chris Cassar, John Evans, Fabio Maino, Jean Walrand, Albert Cabellos

Abstract: The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two rece… ▽ More The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two recent paradigms: Software-Defined Networking (SDN) and Network Analytics (NA), will facilitate the adoption of AI techniques in the context of network operation and control. We describe a new paradigm that accommodates and exploits SDN, NA and AI, and provide use cases that illustrate its applicability and benefits. We also present simple experimental results that support its feasibility. We refer to this new paradigm as Knowledge-Defined Networking (KDN). △ Less

Submitted 23 June, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 8 pages, 22 references, 6 figures and 1 table

Journal ref: ACM SIGCOMM Computer Communication Review, Volume 47, Issue 3, July 2017

Showing 1–4 of 4 results for author: Barkai, S