Search | arXiv e-print repository

Training and inference of large language models using 8-bit floating point

Authors: Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon

Abstract: FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect h… ▽ More FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference. △ Less

Submitted 29 September, 2023; originally announced September 2023.

ACM Class: I.2.7; B.2.4

arXiv:2107.02027 [pdf, other]

Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance

Authors: Mario Michael Krell, Matej Kosec, Sergio P. Perez, Andrew Fitzgibbon

Abstract: Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up… ▽ More Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid cross-contamination in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pre-training in BERT. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices. △ Less

Submitted 5 October, 2022; v1 submitted 29 June, 2021; originally announced July 2021.

Comments: Significantly new version with different authors and much more content. Much larger variety in experiments and exhaustive SOTA analysis

MSC Class: 05-08 ACM Class: I.2.7; G.2.1

arXiv:2007.10753 [pdf, other]

Enhancement of damaged-image prediction through Cahn-Hilliard Image Inpainting

Authors: José A. Carrillo, Serafim Kalliadasis, Fuyue Liang, Sergio P. Perez

Abstract: We assess the benefit of including an image inpainting filter before passing damaged images into a classification neural network. For this we employ a modified Cahn-Hilliard equation as an image inpainting filter, which is solved via a finite volume scheme with reduced computational cost and adequate properties for energy stability and boundedness. The benchmark dataset employed here is MNIST, whi… ▽ More We assess the benefit of including an image inpainting filter before passing damaged images into a classification neural network. For this we employ a modified Cahn-Hilliard equation as an image inpainting filter, which is solved via a finite volume scheme with reduced computational cost and adequate properties for energy stability and boundedness. The benchmark dataset employed here is MNIST, which consists of binary images of handwritten digits and is a standard dataset to validate image-processing methodologies. We train a neural network based of dense layers with the training set of MNIST, and subsequently we contaminate the test set with damage of different types and intensities. We then compare the prediction accuracy of the neural network with and without applying the Cahn-Hilliard filter to the damaged images test. Our results quantify the significant improvement of damaged-image prediction due to applying the Cahn-Hilliard filter, which for specific damages can increase up to 50% and is in general advantageous for low to moderate damage. △ Less

Submitted 15 March, 2021; v1 submitted 21 July, 2020; originally announced July 2020.

Comments: An interactive jupyter notebook with the code of this work is available at https://github.com/sergiopperez/Image_Inpainting. The MNIST dataset employed in this work can be downloaded from http://yann.lecun.com/exdb/mnist/

MSC Class: 68U10; 94A08; 65M22; 76M25; 76M12

Showing 1–3 of 3 results for author: Perez, S P