Skip to main content

Showing 1–5 of 5 results for author: Berchansky, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14105  [pdf, other

    cs.DC cs.AI cs.CL cs.LG

    Distributed Speculative Inference of Large Language Models

    Authors: Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

    Abstract: Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI al… ▽ More

    Submitted 28 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2405.04304  [pdf, other

    cs.CL

    Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

    Authors: Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

    Abstract: Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Op… ▽ More

    Submitted 23 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  3. arXiv:2404.10513  [pdf, other

    cs.CL cs.AI cs.LG

    CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity

    Authors: Moshe Berchansky, Daniel Fleischer, Moshe Wasserblat, Peter Izsak

    Abstract: State-of-the-art performance in QA tasks is currently achieved by systems employing Large Language Models (LLMs), however these models tend to hallucinate information in their responses. One approach focuses on enhancing the generation process by incorporating attribution from the given input to the output. However, the challenge of identifying appropriate attributions and verifying their accuracy… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  4. arXiv:2310.13682  [pdf, other

    cs.CL cs.AI cs.LG

    Optimizing Retrieval-augmented Reader Models via Token Elimination

    Authors: Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe Wasserblat

    Abstract: Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribu… ▽ More

    Submitted 5 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Main Conference

  5. arXiv:2104.07705  [pdf, other

    cs.CL cs.AI cs.LG

    How to Train BERT with an Academic Budget

    Authors: Peter Izsak, Moshe Berchansky, Omer Levy

    Abstract: While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizati… ▽ More

    Submitted 9 September, 2021; v1 submitted 15 April, 2021; originally announced April 2021.