Skip to main content

Showing 1–18 of 18 results for author: Wasserblat, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14105  [pdf, other

    cs.DC cs.AI cs.CL cs.LG

    Distributed Speculative Inference of Large Language Models

    Authors: Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

    Abstract: Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI al… ▽ More

    Submitted 28 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2405.04304  [pdf, other

    cs.CL

    Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

    Authors: Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

    Abstract: Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Op… ▽ More

    Submitted 23 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  3. arXiv:2404.10513  [pdf, other

    cs.CL cs.AI cs.LG

    CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity

    Authors: Moshe Berchansky, Daniel Fleischer, Moshe Wasserblat, Peter Izsak

    Abstract: State-of-the-art performance in QA tasks is currently achieved by systems employing Large Language Models (LLMs), however these models tend to hallucinate information in their responses. One approach focuses on enhancing the generation process by incorporating attribution from the given input to the output. However, the challenge of identifying appropriate attributions and verifying their accuracy… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  4. arXiv:2310.13682  [pdf, other

    cs.CL cs.AI cs.LG

    Optimizing Retrieval-augmented Reader Models via Token Elimination

    Authors: Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe Wasserblat

    Abstract: Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribu… ▽ More

    Submitted 5 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Main Conference

  5. arXiv:2306.16601  [pdf, other

    cs.LG cs.AI cs.CL

    An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

    Authors: Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, Moshe Wasserblat

    Abstract: In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network in… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  6. arXiv:2211.07715  [pdf, other

    cs.CL cs.AI cs.LG

    Fast DistilBERT on CPUs

    Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat

    Abstract: Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve infere… ▽ More

    Submitted 6 December, 2022; v1 submitted 27 October, 2022; originally announced November 2022.

    Comments: 9 pages, NeurIPS 2022, ENLSP Workshop

  7. arXiv:2210.17114  [pdf, other

    cs.CL

    QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

    Authors: Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen

    Abstract: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of la… ▽ More

    Submitted 10 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: In this version we updated the reference to the source code in the abstract. arXiv admin note: text overlap with arXiv:2111.09645

  8. Cross-Domain Aspect Extraction using Transformers Augmented with Knowledge Graphs

    Authors: Phillip Howard, Arden Ma, Vasudev Lal, Ana Paula Simoes, Daniel Korat, Oren Pereg, Moshe Wasserblat, Gadi Singer

    Abstract: The extraction of aspect terms is a critical step in fine-grained sentiment analysis of text. Existing approaches for this task have yielded impressive results when the training and testing data are from the same domain. However, these methods show a drastic decrease in performance when applied to cross-domain settings where the domain of the testing data differs from that of the training data. To… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM 2022). Association for Computing Machinery, New York, NY, USA, 780-790

  9. arXiv:2209.11055  [pdf, other

    cs.CL

    Efficient Few-Shot Learning Without Prompts

    Authors: Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg

    Abstract: Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy. To address these shortcomings, we pr… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  10. arXiv:2204.06271  [pdf, other

    cs.CL cs.AI

    TangoBERT: Reducing Inference Cost by using Cascaded Architecture

    Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Roy Schwartz

    Abstract: The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

  11. arXiv:2111.09645  [pdf, other

    cs.CL cs.LG

    Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

    Authors: Shira Guskin, Moshe Wasserblat, Ke Ding, Gyuwan Kim

    Abstract: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT's performance drops when we reduce the number of layers by 50%, and drops even more… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: ENLSP NeurIPS Workshop 2021, 7 pages

  12. arXiv:2111.05754  [pdf, other

    cs.CL cs.AI cs.LG

    Prune Once for All: Sparse Pre-Trained Language Models

    Authors: Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

    Abstract: Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transf… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

    Comments: ENLSP NeurIPS Workshop 2021, 12 pages

  13. arXiv:1910.06294  [pdf, other

    cs.CL cs.LG

    Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models

    Authors: Peter Izsak, Shira Guskin, Moshe Wasserblat

    Abstract: Training models on low-resource named entity recognition tasks has been shown to be a challenge, especially in industrial applications where deploying updated models is a continuous effort and crucial for business operations. In such cases there is often an abundance of unlabeled data, while labeled data is scarce or unavailable. Pre-trained language models trained to extract contextual features f… ▽ More

    Submitted 17 October, 2019; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: Accepted to the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019

  14. Q8BERT: Quantized 8Bit BERT

    Authors: Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat

    Abstract: Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in productio… ▽ More

    Submitted 17 October, 2019; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: 5 Pages, Accepted at the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019

  15. arXiv:1909.05608  [pdf, other

    cs.CL cs.AI

    ABSApp: A Portable Weakly-Supervised Aspect-Based Sentiment Extraction System

    Authors: Oren Pereg, Daniel Korat, Moshe Wasserblat, Jonathan Mamou, Ido Dagan

    Abstract: We present ABSApp, a portable system for weakly-supervised aspect-based sentiment extraction. The system is interpretable and user friendly and does not require labeled training data, hence can be rapidly and cost-effectively used across different domains in applied setups. The system flow includes three stages: First, it generates domain-specific aspect and opinion lexicons based on an unlabeled… ▽ More

    Submitted 12 September, 2019; originally announced September 2019.

    Comments: 6 pages, demo paper at EMNLP 2019

  16. arXiv:1904.02496  [pdf, ps, other

    cs.CL cs.IR

    Multi-Context Term Embeddings: the Use Case of Corpus-based Term Set Expansion

    Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Ido Dagan

    Abstract: In this paper, we present a novel algorithm that combines multi-context term embeddings using a neural classifier and we test this approach on the use case of corpus-based term set expansion. In addition, we present a novel and unique dataset for intrinsic evaluation of corpus-based term set expansion algorithms. We show that, over this dataset, our algorithm provides up to 5 mean average precisio… ▽ More

    Submitted 10 April, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

    Comments: 6 pages, RepEval 2019 (NAACL-HLT workshop)

  17. arXiv:1808.08953  [pdf, other

    cs.AI cs.CL

    Term Set Expansion based NLP Architect by Intel AI Lab

    Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, Daniel Korat

    Abstract: We present SetExpander, a corpus-based system for expanding a seed set of terms into amore complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to-end workflow. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the extraction of domain-spec… ▽ More

    Submitted 15 October, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018 System Demonstrations. arXiv admin note: substantial text overlap with arXiv:1807.10104

  18. arXiv:1807.10104  [pdf, other

    cs.AI cs.CL

    Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow

    Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, Daniel Korat

    Abstract: We present SetExpander, a corpus-based system for expanding a seed set of terms into a more complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to end workflow for term set expansion. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the e… ▽ More

    Submitted 26 July, 2018; originally announced July 2018.

    Comments: COLING 2018 System Demonstration paper

    MSC Class: 68T50 ACM Class: I.2.7