Search | arXiv e-print repository

doi 10.1007/s10994-023-06480-0

Better Schedules for Low Precision Training of Deep Neural Networks

Authors: Cameron R. Wolfe, Anastasios Kyrillidis

Abstract: Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT imp… ▽ More Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 20 pages, 8 figures, 1 table, ACML 2023

ACM Class: I.2.6; I.2.10; I.4.0

Journal ref: Machine Learning (2024): 1-19

arXiv:2211.04624 [pdf, other]

Cold Start Streaming Learning for Deep Networks

Authors: Cameron R. Wolfe, Anastasios Kyrillidis

Abstract: The ability to dynamically adapt neural networks to newly-available data without performance deterioration would revolutionize deep learning applications. Streaming learning (i.e., learning from one data example at a time) has the potential to enable such real-time adaptation, but current approaches i) freeze a majority of network parameters during streaming and ii) are dependent upon offline, bas… ▽ More The ability to dynamically adapt neural networks to newly-available data without performance deterioration would revolutionize deep learning applications. Streaming learning (i.e., learning from one data example at a time) has the potential to enable such real-time adaptation, but current approaches i) freeze a majority of network parameters during streaming and ii) are dependent upon offline, base initialization procedures over large subsets of data, which damages performance and limits applicability. To mitigate these shortcomings, we propose Cold Start Streaming Learning (CSSL), a simple, end-to-end approach for streaming learning with deep networks that uses a combination of replay and data augmentation to avoid catastrophic forgetting. Because CSSL updates all model parameters during streaming, the algorithm is capable of beginning streaming from a random initialization, making base initialization optional. Going further, the algorithm's simplicity allows theoretical convergence guarantees to be derived using analysis of the Neural Tangent Random Feature (NTRF). In experiments, we find that CSSL outperforms existing baselines for streaming learning in experiments on CIFAR100, ImageNet, and Core50 datasets. Additionally, we propose a novel multi-task streaming learning setting and show that CSSL performs favorably in this domain. Put simply, CSSL performs well and demonstrates that the complicated, multi-step training pipelines adopted by most streaming methodologies can be replaced with a simple, end-to-end learning approach without sacrificing performance. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: 52 pages, 7 figures, pre-print

MSC Class: 68T07 ACM Class: I.2.6; I.2.10; I.4.0

arXiv:2205.12484 [pdf, other]

GisPy: A Tool for Measuring Gist Inference Score in Text

Authors: Pedram Hosseini, Christopher R. Wolfe, Mona Diab, David A. Broniatowski

Abstract: Decision making theories such as Fuzzy-Trace Theory (FTT) suggest that individuals tend to rely on gist, or bottom-line meaning, in the text when making decisions. In this work, we delineate the process of develo** GisPy, an open-source tool in Python for measuring the Gist Inference Score (GIS) in text. Evaluation of GisPy on documents in three benchmarks from the news and scientific text domai… ▽ More Decision making theories such as Fuzzy-Trace Theory (FTT) suggest that individuals tend to rely on gist, or bottom-line meaning, in the text when making decisions. In this work, we delineate the process of develo** GisPy, an open-source tool in Python for measuring the Gist Inference Score (GIS) in text. Evaluation of GisPy on documents in three benchmarks from the news and scientific text domains demonstrates that scores generated by our tool significantly distinguish low vs. high gist documents. Our tool is publicly available to use at: https://github.com/phosseini/GisPy. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: Accepted to the 4th Workshop on Narrative Understanding @ NAACL 2022

arXiv:2203.10428 [pdf, other]

PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication

Authors: Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin

Abstract: Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every… ▽ More Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer during each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple yet effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence analysis but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without any staleness. Furthermore, we develop a smoothing method to further improve PipeGCN's convergence. Extensive experiments show that PipeGCN can largely boost the training throughput (1.7x~28.5x) while achieving the same accuracy as its vanilla counterpart and existing full-graph training methods. The code is available at https://github.com/RICE-EIC/PipeGCN. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: ICLR 2022

arXiv:2112.04905 [pdf, other]

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Authors: Cameron R. Wolfe, Anastasios Kyrillidis

Abstract: We propose a novel, structured pruning algorithm for neural networks -- the iterative, Sparse Structured Pruning algorithm, dubbed as i-SpaSP. Inspired by ideas from sparse signal recovery, i-SpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, t… ▽ More We propose a novel, structured pruning algorithm for neural networks -- the iterative, Sparse Structured Pruning algorithm, dubbed as i-SpaSP. Inspired by ideas from sparse signal recovery, i-SpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, then thresholding these groups based on a smaller, pre-defined pruning ratio. For both two-layer and multi-layer network architectures with ReLU activations, we show the error induced by pruning with i-SpaSP decays polynomially, where the degree of this polynomial becomes arbitrarily large based on the sparsity of the dense network's hidden representations. In our experiments, i-SpaSP is evaluated across a variety of datasets (i.e., MNIST, ImageNet, and XNLI) and architectures (i.e., feed forward networks, ResNet34, MobileNetV2, and BERT), where it is shown to discover high-performing sub-networks and improve upon the pruning efficiency of provable baseline methodologies by several orders of magnitude. Put simply, i-SpaSP is easy to implement with automatic differentiation, achieves strong empirical results, comes with theoretical convergence guarantees, and is efficient, thus distinguishing itself as one of the few computationally efficient, practical, and provable pruning algorithms. △ Less

Submitted 29 March, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

Comments: 29 pages, 4 figures, 4th Annual Conference on Learning for Dynamics and Control

MSC Class: 68T07 ACM Class: I.2.6; I.2.10; I.4.0

arXiv:2108.00259 [pdf, other]

How much pre-training is enough to discover a good subnetwork?

Authors: Cameron R. Wolfe, Fangshuo Liao, Qihan Wang, Junhyung Lyle Kim, Anastasios Kyrillidis

Abstract: Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process -- pre-training, pruning, and re-training -- that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the a… ▽ More Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process -- pre-training, pruning, and re-training -- that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection [61] yields a subnetwork that achieves good training error. Interestingly, this threshold is shown to be logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. Lastly, we empirically validate our theoretical results on a multi-layer perceptron trained on MNIST. △ Less

Submitted 22 August, 2023; v1 submitted 31 July, 2021; originally announced August 2021.

Comments: 29 pages

MSC Class: 68T07 ACM Class: I.2.6; I.2.10; I.4.0

arXiv:2107.13054 [pdf, other]

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

Authors: Cameron R. Wolfe, Keld T. Lundgaard

Abstract: By leveraging large amounts of product data collected across hundreds of live e-commerce websites, we construct 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images. These classification tasks focus on learning the product hierarchy of different e-commerce websites, causing many of them to be correlated. Adopting a multi-modal transformer m… ▽ More By leveraging large amounts of product data collected across hundreds of live e-commerce websites, we construct 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images. These classification tasks focus on learning the product hierarchy of different e-commerce websites, causing many of them to be correlated. Adopting a multi-modal transformer model, we solve these tasks in unison using multi-task learning (MTL). Extensive experiments are presented over an initial 100-task dataset to reveal best practices for "large-scale MTL" (i.e., MTL with more than 100 tasks). From these experiments, a final, unified methodology is derived, which is composed of both best practices and new proposals such as DyPa, a simple heuristic for automatically allocating task-specific parameters to tasks that could benefit from extra capacity. Using our large-scale MTL methodology, we successfully train a single model across all 1000 tasks in our dataset while using minimal task specific parameters, thereby showing that it is possible to extend several orders of magnitude beyond current efforts in MTL. △ Less

Submitted 27 July, 2021; originally announced July 2021.

Comments: 10 pages, 7 figures

MSC Class: 68T07 ACM Class: I.2.6; I.2.7; I.2.10

arXiv:2107.00961 [pdf, other]

ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

Authors: Chen Dun, Cameron R. Wolfe, Christopher M. Jermaine, Anastasios Kyrillidis

Abstract: We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the proc… ▽ More We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance. △ Less

Submitted 14 March, 2022; v1 submitted 2 July, 2021; originally announced July 2021.

Comments: 26 pages, 8 figures, pre-print under review

arXiv:2102.10424 [pdf, other]

GIST: Distributed Training for Large-Scale Graph Convolutional Networks

Authors: Cameron R. Wolfe, **gkang Yang, Arindam Chowdhury, Chen Dun, Artun Bayer, Santiago Segarra, Anastasios Kyrillidis

Abstract: The graph convolutional network (GCN) is a go-to solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored training on large-scale graphs (e.g., GraphSAGE, ClusterGCN, etc.), we pioneer efficient training of large-scale GCN models (i.e., ultra-wide, overparameterized mo… ▽ More The graph convolutional network (GCN) is a go-to solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored training on large-scale graphs (e.g., GraphSAGE, ClusterGCN, etc.), we pioneer efficient training of large-scale GCN models (i.e., ultra-wide, overparameterized models) with the proposal of a novel, distributed training framework. Our proposed training methodology, called GIST, disjointly partitions the parameters of a GCN model into several, smaller sub-GCNs that are trained independently and in parallel. In addition to being compatible with all GCN architectures and existing sampling techniques for efficient GCN training, GIST i) improves model performance, ii) scales to training on arbitrarily large graphs, iii) decreases wall-clock training time, and iv) enables the training of markedly overparameterized GCN models. Remarkably, with GIST, we train an astonishgly-wide 32,768-dimensional GraphSAGE model, which exceeds the capacity of a single GPU by a factor of 8x, to SOTA performance on the Amazon2M dataset. △ Less

Submitted 14 March, 2022; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: 28 pages, 5 figures, pre-print under review

ACM Class: I.2.4

arXiv:1912.00772 [pdf, other]

E-Stitchup: Data Augmentation for Pre-Trained Embeddings

Authors: Cameron R. Wolfe, Keld T. Lundgaard

Abstract: In this work, we propose data augmentation methods for embeddings from pre-trained deep learning models that take a weighted combination of a pair of input embeddings, as inspired by Mixup, and combine such augmentation with extra label softening. These methods are shown to significantly increase classification accuracy, reduce training time, and improve confidence calibration of a downstream mode… ▽ More In this work, we propose data augmentation methods for embeddings from pre-trained deep learning models that take a weighted combination of a pair of input embeddings, as inspired by Mixup, and combine such augmentation with extra label softening. These methods are shown to significantly increase classification accuracy, reduce training time, and improve confidence calibration of a downstream model that is trained with them. As a result of such improved confidence calibration, the model output can be more intuitively interpreted and used to accurately identify out-of-distribution data by applying an appropriate confidence threshold to model predictions. The identified out-of-distribution data can then be prioritized for labeling, thus focusing labeling effort on data that is more likely to boost model performance. These findings, we believe, lay a solid foundation for improving the classification performance and calibration of models that use pre-trained embeddings as input and provide several benefits that prove extremely useful in a production-level deep learning system. △ Less

Submitted 6 October, 2020; v1 submitted 27 November, 2019; originally announced December 2019.

Comments: 11 pages, 7 figures

arXiv:1910.02120 [pdf, other]

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Authors: Binhang Yuan, Cameron R. Wolfe, Chen Dun, Yuxin Tang, Anastasios Kyrillidis, Christopher M. Jermaine

Abstract: Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandato… ▽ More Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches. △ Less

Submitted 18 April, 2022; v1 submitted 4 October, 2019; originally announced October 2019.

arXiv:1903.10103 [pdf, other]

Functional Generative Design of Mechanisms with Recurrent Neural Networks and Novelty Search

Authors: Cameron R. Wolfe, Cem C. Tutum, Risto Miikkulainen

Abstract: Consumer-grade 3D printers have made it easier to fabricate aesthetic objects and static assemblies, opening the door to automated design of such objects. However, while static designs are easily produced with 3D printing, functional designs with moving parts are more difficult to generate: The search space is too high-dimensional, the resolution of the 3D-printed parts is not adequate, and it is… ▽ More Consumer-grade 3D printers have made it easier to fabricate aesthetic objects and static assemblies, opening the door to automated design of such objects. However, while static designs are easily produced with 3D printing, functional designs with moving parts are more difficult to generate: The search space is too high-dimensional, the resolution of the 3D-printed parts is not adequate, and it is difficult to predict the physical behavior of imperfect 3D-printed mechanisms. An example challenge is to produce a diverse set of reliable and effective gear mechanisms that could be used after production without extensive post-processing. To meet this challenge, an indirect encoding based on a Recurrent Neural Network (RNN) is created and evolved using novelty search. The elite solutions of each generation are 3D printed to evaluate their functional performance on a physical test platform. The system is able to discover sequential design rules that are difficult to discover with other methods. Compared to direct encoding evolved with Genetic Algorithms (GAs), its designs are geometrically more diverse and functionally more effective. It therefore forms a promising foundation for the generative design of 3D-printed, functional mechanisms. △ Less

Submitted 24 March, 2019; originally announced March 2019.

Comments: 7 pages, GECCO 2019

Showing 1–12 of 12 results for author: Wolfe, C R