Skip to main content

Showing 1–4 of 4 results for author: Zafrir, O

.
  1. arXiv:2306.16601  [pdf, other

    cs.LG cs.AI cs.CL

    An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

    Authors: Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, Moshe Wasserblat

    Abstract: In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network in… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  2. arXiv:2211.07715  [pdf, other

    cs.CL cs.AI cs.LG

    Fast DistilBERT on CPUs

    Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat

    Abstract: Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve infere… ▽ More

    Submitted 6 December, 2022; v1 submitted 27 October, 2022; originally announced November 2022.

    Comments: 9 pages, NeurIPS 2022, ENLSP Workshop

  3. arXiv:2111.05754  [pdf, other

    cs.CL cs.AI cs.LG

    Prune Once for All: Sparse Pre-Trained Language Models

    Authors: Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

    Abstract: Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transf… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

    Comments: ENLSP NeurIPS Workshop 2021, 12 pages

  4. Q8BERT: Quantized 8Bit BERT

    Authors: Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat

    Abstract: Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in productio… ▽ More

    Submitted 17 October, 2019; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: 5 Pages, Accepted at the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019