-
Exploring Low Rank Training of Deep Neural Networks
Authors:
Siddhartha Rao Kamalakara,
Acyr Locatelli,
Bharat Venkitesh,
Jimmy Ba,
Yarin Gal,
Aidan N. Gomez
Abstract:
Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen…
▽ More
Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen practice. We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field, hinting in the process at exciting research opportunities that still need answering.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Scalable Training of Language Models using JAX pjit and TPUv4
Authors:
Joanna Yoo,
Kuba Perlin,
Siddhartha Rao Kamalakara,
João G. M. Araújo
Abstract:
Modern large language models require distributed training strategies due to their size. The challenges of efficiently and robustly training them are met with rapid developments on both software and hardware frontiers. In this technical report, we explore challenges and design decisions associated with develo** a scalable training framework, and present a quantitative analysis of efficiency impro…
▽ More
Modern large language models require distributed training strategies due to their size. The challenges of efficiently and robustly training them are met with rapid developments on both software and hardware frontiers. In this technical report, we explore challenges and design decisions associated with develo** a scalable training framework, and present a quantitative analysis of efficiency improvements coming from adopting new software and hardware solutions.
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
Learning Sparse Networks Using Targeted Dropout
Authors:
Aidan N. Gomez,
Ivan Zhang,
Siddhartha Rao Kamalakara,
Divyam Madaan,
Kevin Swersky,
Yarin Gal,
Geoffrey E. Hinton
Abstract:
Neural networks are easier to optimise when they have many more weights than are required for modelling the map** from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for traini…
▽ More
Neural networks are easier to optimise when they have many more weights than are required for modelling the map** from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for training a neural network so that it is robust to subsequent pruning. Before computing the gradients for each weight update, targeted dropout stochastically selects a set of units or weights to be dropped using a simple self-reinforcing sparsity criterion and then computes the gradients for the remaining weights. The resulting network is robust to post hoc pruning of weights or units that frequently occur in the dropped sets. The method improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.
△ Less
Submitted 9 September, 2019; v1 submitted 31 May, 2019;
originally announced May 2019.