-
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Authors:
Shaden Smith,
Mostofa Patwary,
Brandon Norick,
Patrick LeGresley,
Samyam Rajbhandari,
Jared Casper,
Zhun Liu,
Shrimai Prabhumoye,
George Zerveas,
Vijay Korthikanti,
Elton Zhang,
Rewon Child,
Reza Yazdani Aminabadi,
Julie Bernauer,
Xia Song,
Mohammad Shoeybi,
Yuxiong He,
Michael Houston,
Saurabh Tiwary,
Bryan Catanzaro
Abstract:
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.…
▽ More
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
△ Less
Submitted 4 February, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Strategies for Maximizing Detection Rate in Radio SETI
Authors:
Kenneth M. Houston,
Andrew P. V. Siemion,
Steve Croft
Abstract:
The Search for Extraterrestrial intelligence (SETI) is a scientific and cultural effort seeking evidence of intelligent life beyond earth. Radio SETI observes the radio spectrum for ''technosignatures" that could be produced by an advanced ET society. This work models radio SETI as an end-to-end system, and focuses on narrow-band intentional transmissions. We look at strategies to maximize the exp…
▽ More
The Search for Extraterrestrial intelligence (SETI) is a scientific and cultural effort seeking evidence of intelligent life beyond earth. Radio SETI observes the radio spectrum for ''technosignatures" that could be produced by an advanced ET society. This work models radio SETI as an end-to-end system, and focuses on narrow-band intentional transmissions. We look at strategies to maximize the expected number of detections per year (DPY) of search. Assuming that ET civilizations will be associated with star systems, we want to maximize the number of stars that may be observed at one time. Assuming a representative star density, this requires maximizing the search volume in a cone defined by the detection range and field of view (FOV). The parameter trades are modified from the case where one simply maximizes signal-to-noise ratio. Instead, a joint optimization between FOV and sensitivity is needed. Some implications: 1) Instead of focusing on the terrestrial microwave window of 1-10 GHz, frequencies below 1 GHz may be optimal for detection rate due to the larger field of view; 2) Arrays of smaller dishes should be favored compared to a single dish of equivalent area; 3) Aperture arrays are desirable due to their large potential FOV. Many radio telescopes under development will provide both high sensitivity and large FOV, and should offer much improved SETI detection rates. Still higher DPY is needed, however, to achieve results in reasonable time horizons, which should be possible by greatly expanding computation capability to the next-generation wide-FOV antenna arrays.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs
Authors:
Liu Yang,
Sean Treichler,
Thorsten Kurth,
Keno Fischer,
David Barajas-Solano,
Josh Romero,
Valentin Churavy,
Alexandre Tartakovsky,
Michael Houston,
Prabhat,
George Karniadakis
Abstract:
Uncertainty quantification for forward and inverse problems is a central challenge across physical and biomedical disciplines. We address this challenge for the problem of modeling subsurface flow at the Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, and multiple correlation length s…
▽ More
Uncertainty quantification for forward and inverse problems is a central challenge across physical and biomedical disciplines. We address this challenge for the problem of modeling subsurface flow at the Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, and multiple correlation length scales of the Hanford Site require training a computationally intensive GAN model to thousands of dimensions. We develop a hierarchical scheme for exploiting domain parallelism, map discriminators and generators to multiple GPUs, and employ efficient communication schemes to ensure training stability and convergence. We developed a highly optimized implementation of this scheme that scales to 27,500 NVIDIA Volta GPUs and 4584 nodes on the Summit supercomputer with a 93.1% scaling efficiency, achieving peak and sustained half-precision rates of 1228 PF/s and 1207 PF/s.
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
Exascale Deep Learning for Climate Analytics
Authors:
Thorsten Kurth,
Sean Treichler,
Joshua Romero,
Mayur Mudigonda,
Nathan Luehr,
Everett Phillips,
Ankur Mahesh,
Michael Matheson,
Jack Deslippe,
Massimiliano Fatica,
Prabhat,
Michael Houston
Abstract:
We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parall…
▽ More
We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.
△ Less
Submitted 3 October, 2018;
originally announced October 2018.
-
Mixed Precision Training
Authors:
Paulius Micikevicius,
Sharan Narang,
Jonah Alben,
Gregory Diamos,
Erich Elsen,
David Garcia,
Boris Ginsburg,
Michael Houston,
Oleksii Kuchaiev,
Ganesh Venkatesh,
Hao Wu
Abstract:
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and g…
▽ More
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.
△ Less
Submitted 15 February, 2018; v1 submitted 10 October, 2017;
originally announced October 2017.
-
A Review of Mathematical Models for Muscular Dystrophy: A Systems Biology Approach
Authors:
Amanda N. Cameron,
Matthew T. Houston,
Juan B. Gutierrez
Abstract:
Muscular dystrophy (MD) describes generalized progressive muscular weakness due to the wasting of muscle fibers. The progression of the disease is affected by known immunological and mechanical factors, and possibly other unknown mechanisms. These dynamics have begun to be elucidated in the last two decades. This article reviews mathematical models of MD that characterize molecular and cellular co…
▽ More
Muscular dystrophy (MD) describes generalized progressive muscular weakness due to the wasting of muscle fibers. The progression of the disease is affected by known immunological and mechanical factors, and possibly other unknown mechanisms. These dynamics have begun to be elucidated in the last two decades. This article reviews mathematical models of MD that characterize molecular and cellular components implicated in MD progression. A biological background for these processes is also presented. Molecular effectors that contribute to MD include mitochondrial bioenergetics and genetic factors; both drive cellular metabolism, communication and signaling. These molecular events leave cells vulnerable to mechanical stress which can activate an immunological cascade that weakens cells and surrounding tissues. This review article lays the foundation for a systems biology approach to study MD progression.
△ Less
Submitted 28 October, 2016; v1 submitted 11 October, 2016;
originally announced October 2016.
-
N-Body Simulations on GPUs
Authors:
Erich Elsen,
V. Vishal,
Mike Houston,
Vijay Pande,
Pat Hanrahan,
Eric Darve
Abstract:
Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that…
▽ More
Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that constitute the major part of stellar and molecular dynamics simulations. In some of the calculations, we achieve sustained performance of nearly 100 GFlops on an ATI X1900XTX. The performance on GPUs is comparable to specialized processors such as GRAPE-6A and MDGRAPE-3, but at a fraction of the cost. Furthermore, the wide availability of GPUs has significant implications for cluster computing and distributed computing efforts like Folding@Home.
△ Less
Submitted 20 June, 2007;
originally announced June 2007.