Skip to main content

Showing 1–19 of 19 results for author: Dettmers, T

.
  1. arXiv:2312.08361  [pdf, other

    cs.LG cs.DC

    Distributed Inference and Fine-tuning of Large Language Models Over The Internet

    Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel

    Abstract: Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategie… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted to Conference on Neural Information Processing Systems (NeurIPS) 2023. 20 pages, 3 figures

  2. arXiv:2310.07707  [pdf, other

    cs.LG cs.CL cs.CV

    MatFormer: Nested Transformer for Elastic Inference

    Authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain

    Abstract: Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting m… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: 31 pages, 12 figures, first three authors contributed equally

  3. arXiv:2306.03078  [pdf, other

    cs.CL cs.LG

    SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

    Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

    Abstract: Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especiall… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Extended preprint

  4. arXiv:2305.14314  [pdf, other

    cs.LG

    QLoRA: Efficient Finetuning of Quantized LLMs

    Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

    Abstract: We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly rel… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Extended NeurIPS submission

  5. arXiv:2305.13999  [pdf, other

    cs.CL cs.LG

    Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

    Authors: Zeyu Leo Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, Xian Li

    Abstract: Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) have proven effective in scaling up Transformers model size for \textit{pretraining} large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while kee** training and inference costs (in FLOPs) fixed. In this work, we analyzed two major design… ▽ More

    Submitted 23 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2023

  6. arXiv:2304.13013  [pdf, other

    cs.LG cs.CV

    Stable and low-precision training for large-scale vision-language models

    Authors: Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, Ludwig Schmidt

    Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus… ▽ More

    Submitted 16 October, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023

  7. arXiv:2301.11913  [pdf, other

    cs.DC cs.LG

    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

    Authors: Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov

    Abstract: Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel… ▽ More

    Submitted 29 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figures

  8. arXiv:2212.09720  [pdf, other

    cs.LG cs.NE

    The case for 4-bit precision: k-bit Inference Scaling Laws

    Authors: Tim Dettmers, Luke Zettlemoyer

    Abstract: Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different… ▽ More

    Submitted 27 February, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

  9. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  10. arXiv:2209.01188  [pdf, other

    cs.LG cs.DC

    Petals: Collaborative Inference and Fine-tuning of Large Models

    Authors: Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel

    Abstract: Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. Still, using these models requires high-end hardware unavailable to many researchers. In some cases, LLMs can be used more affordably via RAM offloading or hosted APIs. However, these tec… ▽ More

    Submitted 2 March, 2023; v1 submitted 2 September, 2022; originally announced September 2022.

    Comments: 10 pages, 4 figures. The version 2 updates the benchmarks and the description of the chat application. Source code and docs: https://petals.ml

  11. arXiv:2208.07339  [pdf, other

    cs.LG cs.AI

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

    Abstract: Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8,… ▽ More

    Submitted 10 November, 2022; v1 submitted 15 August, 2022; originally announced August 2022.

    Comments: Published at NeurIPS 2022. Camera-ready version

  12. arXiv:2208.03306  [pdf, other

    cs.CL

    Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

    Authors: Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, Luke Zettlemoyer

    Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each spec… ▽ More

    Submitted 5 August, 2022; originally announced August 2022.

  13. arXiv:2207.03481  [pdf, other

    cs.LG cs.DC

    Training Transformers Together

    Authors: Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, Yacine Jernite, Thomas Wolf

    Abstract: The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, w… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted to NeurIPS 2021 Demonstration Track. 10 pages, 2 figures. Link: https://training-transformers-together.github.io

  14. arXiv:2110.02861  [pdf, other

    cs.LG

    8-bit Optimizers via Block-wise Quantization

    Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

    Abstract: Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In t… ▽ More

    Submitted 20 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: ICLR2022 spotlight version

  15. arXiv:2103.16716  [pdf, other

    cs.CL

    BASE Layers: Simplifying Training of Large, Sparse Models

    Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer

    Abstract: We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing fu… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

  16. arXiv:1907.04840  [pdf, other

    cs.LG cs.NE stat.ML

    Sparse Networks from Scratch: Faster Training without Losing Performance

    Authors: Tim Dettmers, Luke Zettlemoyer

    Abstract: We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. We accomplish this by develo** sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum… ▽ More

    Submitted 23 August, 2019; v1 submitted 10 July, 2019; originally announced July 2019.

    Comments: 9 page NeurIPS 2019 submission

  17. arXiv:1806.08727  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Jack the Reader - A Machine Reading Framework

    Authors: Dirk Weissenborn, Pasquale Minervini, Tim Dettmers, Isabelle Augenstein, Johannes Welbl, Tim Rocktäschel, Matko Bošnjak, Jeff Mitchell, Thomas Demeester, Pontus Stenetorp, Sebastian Riedel

    Abstract: Many Machine Reading and Natural Language Understanding tasks require reading supporting text in order to answer questions. For example, in Question Answering, the supporting text can be newswire or Wikipedia articles; in Natural Language Inference, premises can be seen as the supporting text and hypotheses as questions. Providing a set of useful primitives operating in a single framework of relat… ▽ More

    Submitted 19 June, 2018; originally announced June 2018.

    Comments: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2018), System Demonstrations

  18. arXiv:1707.01476  [pdf, other

    cs.LG

    Convolutional 2D Knowledge Graph Embeddings

    Authors: Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel

    Abstract: Link prediction for knowledge graphs is the task of predicting missing relationships between entities. Previous work on link prediction has focused on shallow, fast models which can scale to large knowledge graphs. However, these models learn less expressive features than deep, multi-layer models -- which potentially limits performance. In this work, we introduce ConvE, a multi-layer convolutional… ▽ More

    Submitted 4 July, 2018; v1 submitted 5 July, 2017; originally announced July 2017.

    Comments: Extended AAAI2018 paper

  19. arXiv:1511.04561  [pdf, other

    cs.NE cs.LG

    8-Bit Approximations for Parallelism in Deep Learning

    Authors: Tim Dettmers

    Abstract: The creation of practical deep learning data-products often requires parallelization across processors and computers to make deep learning feasible on large data sets, but bottlenecks in communication bandwidth make it difficult to attain good speedups through parallelism. Here we develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit… ▽ More

    Submitted 19 February, 2016; v1 submitted 14 November, 2015; originally announced November 2015.