Skip to main content

Showing 1–16 of 16 results for author: Elhoushi, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00434  [pdf, other

    cs.CL

    Brevity is the soul of wit: Pruning long files for code generation

    Authors: Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos

    Abstract: Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three t… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 15 pages, 5 figures

  2. arXiv:2404.16710  [pdf, other

    cs.CL cs.AI cs.LG

    LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

    Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu

    Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exi… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Code open sourcing is in progress

  3. arXiv:2403.08058  [pdf, other

    cs.LG cs.CL

    CHAI: Clustered Head Attention for Efficient LLM Inference

    Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Ye** Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

    Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and comput… ▽ More

    Submitted 27 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  4. arXiv:2403.04814  [pdf, other

    cs.CL cs.AI cs.LG cs.SE

    Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

    Authors: Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

    Abstract: We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 t… ▽ More

    Submitted 22 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: 22 pages; ICML 2024 Oral: https://icml.cc/virtual/2024/oral/35482

  5. arXiv:2401.06145  [pdf, other

    cs.DC cs.CV cs.LG cs.PF

    Minuet: Accelerating 3D Sparse Convolutions on GPUs

    Authors: Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James Gleeson, Gennady Pekhimenko

    Abstract: Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be e… ▽ More

    Submitted 1 December, 2023; originally announced January 2024.

  6. arXiv:2401.03003  [pdf, other

    cs.SE cs.CL cs.LG

    AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

    Authors: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung

    Abstract: Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code struct… ▽ More

    Submitted 22 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

    Comments: 15 pages; ICML 2024: https://icml.cc/virtual/2024/poster/33601

  7. arXiv:2312.02418  [pdf, other

    cs.CL cs.AI cs.LG

    Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

    Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani

    Abstract: Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety,… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 12 pages, 4 figures, Oral Presentation at 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 2023

  8. arXiv:2310.02110  [pdf, other

    cs.CV

    Sieve: Multimodal Dataset Pruning Using Image Captioning Models

    Authors: Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, Ari Morcos

    Abstract: Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. W… ▽ More

    Submitted 10 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: Accepted in CVPR 2024

  9. arXiv:2309.07062  [pdf, other

    cs.PL cs.AI cs.CL cs.LG

    Large Language Models for Compiler Optimization

    Authors: Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather

    Abstract: We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after opt… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  10. arXiv:2301.05104  [pdf, other

    cs.PL cs.AI cs.LG

    Learning Compiler Pass Orders using Coreset and Normalized Value Prediction

    Authors: Youwei Liang, Kevin Stone, Ali Shameli, Chris Cummins, Mostafa Elhoushi, Jiadong Guo, Benoit Steiner, Xiaomeng Yang, Pengtao Xie, Hugh Leather, Yuandong Tian

    Abstract: Finding the optimal pass sequence of compilation can lead to a significant reduction in program size and/or improvement in program efficiency. Prior works on compilation pass ordering have two major drawbacks. They either require an excessive budget (in terms of compilation steps) at compile time or fail to generalize to unseen programs. In this paper, for code-size reduction tasks, we propose a n… ▽ More

    Submitted 27 January, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

  11. arXiv:2210.12924  [pdf, other

    cs.LG

    OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks

    Authors: Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, James Hegarty

    Abstract: The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as spilling and recomputation, which increase training time, or reduced precision and model pruning, which can affect model accuracy. We present OLLA, an algorithm… ▽ More

    Submitted 2 November, 2022; v1 submitted 23 October, 2022; originally announced October 2022.

  12. arXiv:2207.08389  [pdf, other

    cs.PL cs.AI cs.LG cs.NE cs.PF

    MLGOPerf: An ML Guided Inliner to Optimize Performance

    Authors: Amir H. Ashouri, Mostafa Elhoushi, Yuzhe Hua, Xiang Wang, Muhammad Asif Manzoor, Bryan Chan, Yaoqing Gao

    Abstract: For the past 25 years, we have witnessed an extensive application of Machine Learning to the Compiler space; the selection and the phase-ordering problem. However, limited works have been upstreamed into the state-of-the-art compilers, i.e., LLVM, to seamlessly integrate the former into the optimization pipeline of a compiler to be readily deployed by the user. MLGO was among the first of such pro… ▽ More

    Submitted 19 July, 2022; v1 submitted 18 July, 2022; originally announced July 2022.

    Comments: Version 2: Added the missing Table 6. The short version of this work is accepted at ACM/IEEE CASES 2022

    ACM Class: I.2.5; D.3.0; I.2.6

  13. arXiv:2110.08232  [pdf, other

    cs.CV cs.LG

    Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction

    Authors: Sara Elkerdawy, Mostafa Elhoushi, Hong Zhang, Nilanjan Ray

    Abstract: Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment. However, current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss. This formulation introduces complexity in balancing different losses (e.g task loss, regularization loss). In addition, regulariza… ▽ More

    Submitted 28 June, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  14. arXiv:2007.05667  [pdf, other

    cs.CV

    To Filter Prune, or to Layer Prune, That Is The Question

    Authors: Sara Elkerdawy, Mostafa Elhoushi, Abhineet Singh, Hong Zhang, Nilanjan Ray

    Abstract: Recent advances in pruning of neural networks have made it possible to remove a large number of filters or weights without any perceptible drop in accuracy. The number of parameters and that of FLOPs are usually the reported metrics to measure the quality of the pruned models. However, the gain in speed for these pruned models is often overlooked in the literature due to the complex nature of late… ▽ More

    Submitted 8 November, 2020; v1 submitted 10 July, 2020; originally announced July 2020.

  15. arXiv:1909.05675  [pdf, other

    cs.CV cs.AI

    Accelerating Training using Tensor Decomposition

    Authors: Mostafa Elhoushi, Ye Henry Tian, Zihao Chen, Farhan Shafiq, Joey Yiwei Li

    Abstract: Tensor decomposition is one of the well-known approaches to reduce the latency time and number of parameters of a pre-trained model. However, in this paper, we propose an approach to use tensor decomposition to reduce training time of training a model from scratch. In our approach, we train the model from scratch (i.e., randomly initialized weights) with its original architecture for a small numbe… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Journal ref: AAAI 2020 Artificial Intelligence of Things Workshop

  16. arXiv:1905.13298  [pdf, other

    cs.LG cs.NE

    DeepShift: Towards Multiplication-Less Neural Networks

    Authors: Mostafa Elhoushi, Zihao Chen, Farhan Shafiq, Ye Henry Tian, Joey Yiwei Li

    Abstract: The high computation, memory, and power budgets of inferring convolutional neural networks (CNNs) are major bottlenecks of model deployment to edge computing platforms, e.g., mobile devices and IoT. Moreover, training CNNs is time and energy-intensive even on high-grade servers. Convolution layers and fully connected layers, because of their intense use of multiplications, are the dominant contrib… ▽ More

    Submitted 7 July, 2021; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: -Added results for 8-bit and 16-bit fixed point activations, as well as 5-bit, 4-bit, 3-bit, and 2-bit weights. - Added link to GitHub code - Updated and fixed the training algorithm - Introduced 2 approaches for backward and forward pases - Showed better results for training from scratch on CIFAR10 and Imagenet - Added implementation on NVIDIA's GPU -Accepted in CVPR Mobile AI 2021 Workshop

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021