Skip to main content

Showing 1–16 of 16 results for author: Anthony, Q

.
  1. arXiv:2406.01981  [pdf, other

    cs.CL cs.AI

    Zyda: A 1.3T Dataset for Open Language Modeling

    Authors: Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony

    Abstract: The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2405.16712  [pdf, other

    cs.LG cs.AI cs.CL

    Zamba: A Compact 7B SSM Hybrid Model

    Authors: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

    Abstract: In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, th… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  3. arXiv:2404.05892  [pdf, other

    cs.CL cs.AI

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

    Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV, Jan Kocoń, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanisław Woźniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao , et al. (3 additional authors not shown)

    Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokeni… ▽ More

    Submitted 10 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  4. arXiv:2403.08763  [pdf, other

    cs.LG cs.AI cs.CL

    Simple and Scalable Strategies to Continually Pre-train Large Language Models

    Authors: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish

    Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptati… ▽ More

    Submitted 26 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  5. arXiv:2402.01771  [pdf, other

    cs.CL cs.AI cs.DC cs.LG

    BlackMamba: Mixture of Experts for State-Space Models

    Authors: Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge

    Abstract: State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models hav… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  6. arXiv:2402.00691  [pdf, other

    cs.DC

    Comparative Study of Large Language Model Architectures on Frontier

    Authors: Junqi Yin, Avishek Bose, Guo**g Cong, Isaac Lyngaas, Quentin Anthony

    Abstract: Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resultin… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  7. arXiv:2401.14489  [pdf, other

    cs.DC cs.AI

    The Case for Co-Designing Model Architectures with Hardware

    Authors: Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

    Abstract: While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set… ▽ More

    Submitted 30 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

  8. arXiv:2401.08383  [pdf, other

    cs.LG cs.AI cs.DC

    Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

    Authors: **ghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

    Abstract: In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. T… ▽ More

    Submitted 16 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

  9. arXiv:2308.04014  [pdf, other

    cs.CL cs.LG

    Continual Pre-Training of Large Language Models: How to (re)warm your model?

    Authors: Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort

    Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data t… ▽ More

    Submitted 6 September, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

  10. arXiv:2305.13048  [pdf, other

    cs.CL cs.AI

    RWKV: Reinventing RNNs for the Transformer Era

    Authors: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang , et al. (9 additional authors not shown)

    Abstract: Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scala… ▽ More

    Submitted 10 December, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  11. arXiv:2304.11158  [pdf, other

    cs.CL

    Emergent and Predictable Memorization in Large Language Models

    Authors: Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, Edward Raff

    Abstract: Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for safely deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for m… ▽ More

    Submitted 31 May, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  12. arXiv:2304.01373  [pdf, other

    cs.CL

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Authors: Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal

    Abstract: How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools… ▽ More

    Submitted 31 May, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: Code at https://github.com/EleutherAI/pythia

  13. arXiv:2303.08374  [pdf, other

    cs.DC cs.LG

    MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

    Authors: Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

    Abstract: In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of co… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted, to be presented at IPDPS 2023

  14. arXiv:2204.06745  [pdf, other

    cs.CL

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    Authors: Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach

    Abstract: We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and trainin… ▽ More

    Submitted 14 April, 2022; originally announced April 2022.

    Comments: To appear in the Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models

  15. arXiv:2109.08329  [pdf, other

    cs.GR cs.DC cs.PF

    Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters

    Authors: Pouya Kousha, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: Understanding and visualizing the full-stack performance trade-offs and interplay between HPC applications, MPI libraries, the communication fabric, and the file system is a challenging endeavor. Designing a holistic profiling and visualization method for HPC communication networks is challenging since different levels of communication coexist and interact with each other on the communication fabr… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 11 pages, under submission

  16. arXiv:1911.05146  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow

    Authors: Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid-parallelism efficiently. Four major problems we focus on are: 1) defining a notion of a distrib… ▽ More

    Submitted 19 February, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

    Comments: 18 pages, 10 figures, Accepted, to be presented at ISC '20