-
Gap Aware Mitigation of Gradient Staleness
Authors:
Saar Barkai,
Ido Hakimi,
Assaf Schuster
Abstract:
Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which…
▽ More
Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up.
△ Less
Submitted 3 February, 2020; v1 submitted 24 September, 2019;
originally announced September 2019.
-
Taming Momentum in a Distributed Asynchronous Environment
Authors:
Ido Hakimi,
Saar Barkai,
Moshe Gabel,
Assaf Schuster
Abstract:
Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is o…
▽ More
Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is often used to accelerate convergence and escape local minima, exacerbates the gradient staleness, thereby hindering convergence. We propose DANA: a novel technique for asynchronous distributed SGD with momentum that mitigates gradient staleness by computing the gradient on an estimated future position of the model's parameters. Thereby, we show for the first time that momentum can be fully incorporated in asynchronous training with almost no ramifications to final accuracy. Our evaluation on the CIFAR and ImageNet datasets shows that DANA outperforms existing methods, in both final accuracy and convergence speed while scaling up to a total batch size of 16K on 64 asynchronous workers.
△ Less
Submitted 14 October, 2020; v1 submitted 26 July, 2019;
originally announced July 2019.
-
SDN for End-Nodes: Scenario Analysis and Architectural Guidelines
Authors:
Alberto Rodriguez-Natal,
Vina Ermagan,
Kien Nguyen,
Sharon Barkai,
Yusheng Ji,
Fabio Maino,
Albert Cabellos-Aparicio
Abstract:
The advent of SDN has brought a plethora of new architectures and controller designs for many use-cases and scenarios. Existing SDN deployments focus on campus, datacenter and WAN networks. However, little research efforts have been devoted to the scenario of effectively controlling a full deployment of end-nodes (e.g. smartphones) that are transient and scattered across the Internet. In this pape…
▽ More
The advent of SDN has brought a plethora of new architectures and controller designs for many use-cases and scenarios. Existing SDN deployments focus on campus, datacenter and WAN networks. However, little research efforts have been devoted to the scenario of effectively controlling a full deployment of end-nodes (e.g. smartphones) that are transient and scattered across the Internet. In this paper, we present a rigorous analysis of the challenges associated with an SDN architecture for end-nodes, show that such challenges are not found in existing SDN scenarios, and provide practical design guidelines to address them. Then, and following these guidelines we present a reference architecture based on a decentralized, distributed and symmetric controller with a connectionless pull-oriented southbound and an intent-driven northbound. Finally, we measure a proof-of-concept deployment to assess the validity of the analysis as well as the architecture.
△ Less
Submitted 16 March, 2018;
originally announced March 2018.
-
Knowledge-Defined Networking
Authors:
Albert Mestres,
Alberto Rodriguez-Natal,
Josep Carner,
Pere Barlet-Ros,
Eduard Alarcón,
Marc Solé,
Victor Muntés,
David Meyer,
Sharon Barkai,
Mike J Hibbett,
Giovani Estrada,
Khaldun Ma`ruf,
Florin Coras,
Vina Ermagan,
Hugo Latapie,
Chris Cassar,
John Evans,
Fabio Maino,
Jean Walrand,
Albert Cabellos
Abstract:
The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two rece…
▽ More
The research community has considered in the past the application of Artificial Intelligence (AI) techniques to control and operate networks. A notable example is the Knowledge Plane proposed by D.Clark et al. However, such techniques have not been extensively prototyped or deployed in the field yet. In this paper, we explore the reasons for the lack of adoption and posit that the rise of two recent paradigms: Software-Defined Networking (SDN) and Network Analytics (NA), will facilitate the adoption of AI techniques in the context of network operation and control. We describe a new paradigm that accommodates and exploits SDN, NA and AI, and provide use cases that illustrate its applicability and benefits. We also present simple experimental results that support its feasibility. We refer to this new paradigm as Knowledge-Defined Networking (KDN).
△ Less
Submitted 23 June, 2016; v1 submitted 20 June, 2016;
originally announced June 2016.