Search | arXiv e-print repository

Distributed Speculative Inference of Large Language Models

Authors: Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

Abstract: Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI al… ▽ More Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI. △ Less

Submitted 28 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.04304 [pdf, other]

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Authors: Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

Abstract: Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Op… ▽ More Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text. △ Less

Submitted 23 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2306.02307 [pdf, other]

Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings

Authors: Daniel Rotem, Michael Hassid, Jonathan Mamou, Roy Schwartz

Abstract: Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the sam… ▽ More Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the same architecture and size, individual Multi-Model classifiers outperform their Early-Exit counterparts by an average of 2.3%. We show that this gap is caused by Early-Exit classifiers sharing model parameters during training, resulting in conflicting gradient updates of model weights. We find that despite this gap, Early-Exit still provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach. To address these issues, we propose SWEET (Separating Weights in Early Exit Transformers), an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights, not updated by other classifiers. We compare SWEET's speed-accuracy curve to standard Early-Exit and Multi-Model baselines and find that it outperforms both methods at fast speeds while maintaining comparable scores to Early-Exit at slow speeds. Moreover, SWEET individual classifiers outperform Early-Exit ones by 1.1% on average. SWEET enjoys the benefits of both methods, paving the way for further reduction of inference costs in NLP. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Proceedings of ACL 2023

arXiv:2204.06271 [pdf, other]

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Roy Schwartz

Abstract: The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient… ▽ More The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model, and only part of those instances are additionally processed by a less efficient but more accurate second tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. Our simple method has several appealing practical advantages compared to standard cascading approaches based on multi-layered transformer models. First, it enables higher speedup gains (average lower latency). Second, it takes advantage of batch size optimization for cascading, which increases the relative inference cost reductions. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task. Experimental results show that TangoBERT outperforms efficient early exit baseline models; on the the SST-2 task, it achieves an accuracy of 93.9% with a CPU speedup of 8.2x. △ Less

Submitted 13 April, 2022; originally announced April 2022.

arXiv:2104.07578 [pdf, other]

Syntactic Perturbations Reveal Representational Correlates of Hierarchical Phrase Structure in Pretrained Language Models

Authors: Matteo Alleman, Jonathan Mamou, Miguel A Del Rio, Hanlin Tang, Yoon Kim, SueYeon Chung

Abstract: While vector-based language representations from pretrained language models have set a new standard for many NLP tasks, there is not yet a complete accounting of their inner workings. In particular, it is not entirely clear what aspects of sentence-level syntax are captured by these representations, nor how (if at all) they are built along the stacked layers of the network. In this paper, we aim t… ▽ More While vector-based language representations from pretrained language models have set a new standard for many NLP tasks, there is not yet a complete accounting of their inner workings. In particular, it is not entirely clear what aspects of sentence-level syntax are captured by these representations, nor how (if at all) they are built along the stacked layers of the network. In this paper, we aim to address such questions with a general class of interventional, input perturbation-based analyses of representations from pretrained language models. Importing from computational and cognitive neuroscience the notion of representational invariance, we perform a series of probes designed to test the sensitivity of these representations to several kinds of structure in sentences. Each probe involves swap** words in a sentence and comparing the representations from perturbed sentences against the original. We experiment with three different perturbations: (1) random permutations of n-grams of varying width, to test the scale at which a representation is sensitive to word position; (2) swap** of two spans which do or do not form a syntactic phrase, to test sensitivity to global phrase structure; and (3) swap** of two adjacent words which do or do not break apart a syntactic phrase, to test sensitivity to local phrase structure. Results from these probes collectively suggest that Transformers build sensitivity to larger parts of the sentence along their layers, and that hierarchical phrase structure plays a role in this process. More broadly, our results also indicate that structured input perturbations widens the scope of analyses that can be performed on often-opaque deep learning systems, and can serve as a complement to existing tools (such as supervised linear probes) for interpreting complex black-box models. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: 12 pages, 7 figures

arXiv:2006.01095 [pdf, other]

Emergence of Separable Manifolds in Deep Language Representations

Authors: Jonathan Mamou, Hang Le, Miguel Del Rio, Cory Stephenson, Hanlin Tang, Yoon Kim, SueYeon Chung

Abstract: Deep neural networks (DNNs) have shown much empirical success in solving perceptual tasks across various cognitive modalities. While they are only loosely inspired by the biological brain, recent studies report considerable similarities between representations extracted from task-optimized DNNs and neural populations in the brain. DNNs have subsequently become a popular model class to infer comput… ▽ More Deep neural networks (DNNs) have shown much empirical success in solving perceptual tasks across various cognitive modalities. While they are only loosely inspired by the biological brain, recent studies report considerable similarities between representations extracted from task-optimized DNNs and neural populations in the brain. DNNs have subsequently become a popular model class to infer computational principles underlying complex cognitive functions, and in turn, they have also emerged as a natural testbed for applying methods originally developed to probe information in neural populations. In this work, we utilize mean-field theoretic manifold analysis, a recent technique from computational neuroscience that connects geometry of feature representations with linear separability of classes, to analyze language representations from large-scale contextual embedding models. We explore representations from different model families (BERT, RoBERTa, GPT, etc.) and find evidence for emergence of linguistic manifolds across layer depth (e.g., manifolds for part-of-speech tags), especially in ambiguous data (i.e, words with multiple part-of-speech tags, or part-of-speech classes including many words). In addition, we find that the emergence of linear separability in these manifolds is driven by a combined reduction of manifolds' radius, dimensionality and inter-manifold correlations. △ Less

Submitted 8 July, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: 9 pages. 10 figures. Accepted to ICML 2020. Included supplemental materials

arXiv:1911.03243 [pdf, ps, other]

Controlled Crowdsourcing for High-Quality QA-SRL Annotation

Authors: Paul Roit, Ayal Klein, Daniela Stepanov, Jonathan Mamou, Julian Michael, Gabriel Stanovsky, Luke Zettlemoyer, Ido Dagan

Abstract: Question-answer driven Semantic Role Labeling (QA-SRL) was proposed as an attractive open and natural flavour of SRL, potentially attainable from laymen. Recently, a large-scale crowdsourced QA-SRL corpus and a trained parser were released. Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations were lacking in quality, particularly in coverage, making them… ▽ More Question-answer driven Semantic Role Labeling (QA-SRL) was proposed as an attractive open and natural flavour of SRL, potentially attainable from laymen. Recently, a large-scale crowdsourced QA-SRL corpus and a trained parser were released. Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations were lacking in quality, particularly in coverage, making them insufficient for further research and evaluation. In this paper, we present an improved crowdsourcing protocol for complex semantic annotation, involving worker selection and training, and a data consolidation phase. Applying this protocol to QA-SRL yielded high-quality annotation with drastically higher coverage, producing a new gold evaluation dataset. We believe that our annotation protocol and gold standard will facilitate future replicable research of natural semantic annotations. △ Less

Submitted 13 May, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

arXiv:1910.09061 [pdf, other]

Deep Mouse: An End-to-end Auto-context Refinement Framework for Brain Ventricle and Body Segmentation in Embryonic Mice Ultrasound Volumes

Authors: Tongda Xu, Ziming Qiu, William Das, Chuiyu Wang, Jack Langerman, Nitin Nair, Orlando Aristizabal, Jonathan Mamou, Daniel H. Turnbull, Jeffrey A. Ketterling, Yao Wang

Abstract: High-frequency ultrasound (HFU) is well suited for imaging embryonic mice due to its noninvasive and real-time characteristics. However, manual segmentation of the brain ventricles (BVs) and body requires substantial time and expertise. This work proposes a novel deep learning based end-to-end auto-context refinement framework, consisting of two stages. The first stage produces a low resolution se… ▽ More High-frequency ultrasound (HFU) is well suited for imaging embryonic mice due to its noninvasive and real-time characteristics. However, manual segmentation of the brain ventricles (BVs) and body requires substantial time and expertise. This work proposes a novel deep learning based end-to-end auto-context refinement framework, consisting of two stages. The first stage produces a low resolution segmentation of the BV and body simultaneously. The resulting probability map for each object (BV or body) is then used to crop a region of interest (ROI) around the target object in both the original image and the probability map to provide context to the refinement segmentation network. Joint training of the two stages provides significant improvement in Dice Similarity Coefficient (DSC) over using only the first stage (0.818 to 0.906 for the BV, and 0.919 to 0.934 for the body). The proposed method significantly reduces the inference time (102.36 to 0.09 s/volume around 1000x faster) while slightly improves the segmentation accuracy over the previous methods using slide-window approaches. △ Less

Submitted 29 October, 2019; v1 submitted 20 October, 2019; originally announced October 2019.

Comments: Full Paper Submission to ISBI 2020

arXiv:1909.10555 [pdf]

Automatic Mouse Embryo Brain Ventricle & Body Segmentation and Mutant Classification From Ultrasound Data Using Deep Learning

Authors: Ziming Qiu, Nitin Nair, Jack Langerman, Orlando Aristizabal, Jonathan Mamou, Daniel H. Turnbull, Jeffrey A. Ketterling, Yao Wang

Abstract: High-frequency ultrasound (HFU) is well suited for imaging embryonic mice in vivo because it is non-invasive and real-time. Manual segmentation of the brain ventricles (BVs) and whole body from 3D HFU images is time-consuming and requires specialized training. This paper presents a deep-learning-based segmentation pipeline which automates several time-consuming, repetitive tasks currently performe… ▽ More High-frequency ultrasound (HFU) is well suited for imaging embryonic mice in vivo because it is non-invasive and real-time. Manual segmentation of the brain ventricles (BVs) and whole body from 3D HFU images is time-consuming and requires specialized training. This paper presents a deep-learning-based segmentation pipeline which automates several time-consuming, repetitive tasks currently performed to study genetic mutations in develo** mouse embryos. Namely, the pipeline accurately segments the BV and body regions in 3D HFU images of mouse embryos, despite significant challenges due to position and shape variation of the embryos, as well as imaging artifacts. Based on the BV segmentation, a 3D convolutional neural network (CNN) is further trained to detect embryos with the Engrailed-1 (En1) mutation. The algorithms achieve 0.896 and 0.925 Dice Similarity Coefficient (DSC) for BV and body segmentation, respectively, and 95.8% accuracy on mutant classification. Through gradient based interrogation and visualization of the trained classifier, it is demonstrated that the model focuses on the morphological structures known to be affected by the En1 mutation. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: 4 pages, 6 figures, the 2019 IEEE International Ultrasonics Symposium

arXiv:1909.05608 [pdf, other]

ABSApp: A Portable Weakly-Supervised Aspect-Based Sentiment Extraction System

Authors: Oren Pereg, Daniel Korat, Moshe Wasserblat, Jonathan Mamou, Ido Dagan

Abstract: We present ABSApp, a portable system for weakly-supervised aspect-based sentiment extraction. The system is interpretable and user friendly and does not require labeled training data, hence can be rapidly and cost-effectively used across different domains in applied setups. The system flow includes three stages: First, it generates domain-specific aspect and opinion lexicons based on an unlabeled… ▽ More We present ABSApp, a portable system for weakly-supervised aspect-based sentiment extraction. The system is interpretable and user friendly and does not require labeled training data, hence can be rapidly and cost-effectively used across different domains in applied setups. The system flow includes three stages: First, it generates domain-specific aspect and opinion lexicons based on an unlabeled dataset; second, it enables the user to view and edit those lexicons (weak supervision); and finally, it enables the user to select an unlabeled target dataset from the same domain, classify it, and generate an aspect-based sentiment report. ABSApp has been successfully used in a number of real-life use cases, among them movie review analysis and convention impact analysis. △ Less

Submitted 12 September, 2019; originally announced September 2019.

Comments: 6 pages, demo paper at EMNLP 2019

arXiv:1904.02496 [pdf, ps, other]

Multi-Context Term Embeddings: the Use Case of Corpus-based Term Set Expansion

Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Ido Dagan

Abstract: In this paper, we present a novel algorithm that combines multi-context term embeddings using a neural classifier and we test this approach on the use case of corpus-based term set expansion. In addition, we present a novel and unique dataset for intrinsic evaluation of corpus-based term set expansion algorithms. We show that, over this dataset, our algorithm provides up to 5 mean average precisio… ▽ More In this paper, we present a novel algorithm that combines multi-context term embeddings using a neural classifier and we test this approach on the use case of corpus-based term set expansion. In addition, we present a novel and unique dataset for intrinsic evaluation of corpus-based term set expansion algorithms. We show that, over this dataset, our algorithm provides up to 5 mean average precision points over the best baseline. △ Less

Submitted 10 April, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

Comments: 6 pages, RepEval 2019 (NAACL-HLT workshop)

arXiv:1811.03601 [pdf]

Deep BV: A Fully Automated System for Brain Ventricle Localization and Segmentation in 3D Ultrasound Images of Embryonic Mice

Authors: Ziming Qiu, Jack Langerman, Nitin Nair, Orlando Aristizabal, Jonathan Mamou, Daniel H. Turnbull, Jeffrey Ketterling, Yao Wang

Abstract: Volumetric analysis of brain ventricle (BV) structure is a key tool in the study of central nervous system development in embryonic mice. High-frequency ultrasound (HFU) is the only non-invasive, real-time modality available for rapid volumetric imaging of embryos in utero. However, manual segmentation of the BV from HFU volumes is tedious, time-consuming, and requires specialized expertise. In th… ▽ More Volumetric analysis of brain ventricle (BV) structure is a key tool in the study of central nervous system development in embryonic mice. High-frequency ultrasound (HFU) is the only non-invasive, real-time modality available for rapid volumetric imaging of embryos in utero. However, manual segmentation of the BV from HFU volumes is tedious, time-consuming, and requires specialized expertise. In this paper, we propose a novel deep learning based BV segmentation system for whole-body HFU images of mouse embryos. Our fully automated system consists of two modules: localization and segmentation. It first applies a volumetric convolutional neural network on a 3D sliding window over the entire volume to identify a 3D bounding box containing the entire BV. It then employs a fully convolutional network to segment the detected bounding box into BV and background. The system achieves a Dice Similarity Coefficient (DSC) of 0.8956 for BV segmentation on an unseen 111 HFU volume test set surpassing the previous state-of-the-art method (DSC of 0.7119) by a margin of 25%. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: IEEE Signal Processing in Medicine and Biology Symposium - 2018, 6 pages, 5 figures

ACM Class: I.2.6; I.4.6; I.5.1; I.5.4; I.5.5; J.3

arXiv:1808.08953 [pdf, other]

Term Set Expansion based NLP Architect by Intel AI Lab

Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, Daniel Korat

Abstract: We present SetExpander, a corpus-based system for expanding a seed set of terms into amore complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to-end workflow. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the extraction of domain-spec… ▽ More We present SetExpander, a corpus-based system for expanding a seed set of terms into amore complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to-end workflow. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the extraction of domain-specific fine-grained semantic classes.SetExpander has been used successfully in real-life use cases including integration into an automated recruitment system and an issues and defects resolution system. A video demo of SetExpander is available at https://drive.google.com/open?id=1e545bB87Autsch36DjnJHmq3HWfSd1Rv (some images were blurred for privacy reasons) △ Less

Submitted 15 October, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

Comments: EMNLP 2018 System Demonstrations. arXiv admin note: substantial text overlap with arXiv:1807.10104

arXiv:1807.10104 [pdf, other]

Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow

Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, Daniel Korat

Abstract: We present SetExpander, a corpus-based system for expanding a seed set of terms into a more complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to end workflow for term set expansion. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the e… ▽ More We present SetExpander, a corpus-based system for expanding a seed set of terms into a more complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to end workflow for term set expansion. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the extraction of domain-specific fine-grained semantic classes. SetExpander has been used for solving real-life use cases including integration in an automated recruitment system and an issues and defects resolution system. A video demo of SetExpander is available at https://drive.google.com/open?id=1e545bB87Autsch36DjnJHmq3HWfSd1Rv (some images were blurred for privacy reasons). △ Less

Submitted 26 July, 2018; originally announced July 2018.

Comments: COLING 2018 System Demonstration paper

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1705.07015 [pdf, other]

Segmentation of 3D High-frequency Ultrasound Images of Human Lymph Nodes Using Graph Cut with Energy Functional Adapted to Local Intensity Distribution

Authors: Jen-wei Kuo, Jonathan Mamou, Yao Wang, Emi Saegusa-Beecroft, Junji Machi, Ernest J. Feleppa

Abstract: Previous studies by our group have shown that three-dimensional high-frequency quantitative ultrasound methods have the potential to differentiate metastatic lymph nodes from cancer-free lymph nodes dissected from human cancer patients. To successfully perform these methods inside the lymph node parenchyma, an automatic segmentation method is highly desired to exclude the surrounding thin layer of… ▽ More Previous studies by our group have shown that three-dimensional high-frequency quantitative ultrasound methods have the potential to differentiate metastatic lymph nodes from cancer-free lymph nodes dissected from human cancer patients. To successfully perform these methods inside the lymph node parenchyma, an automatic segmentation method is highly desired to exclude the surrounding thin layer of fat from quantitative ultrasound processing and accurately correct for ultrasound attenuation. In high-frequency ultrasound images of lymph nodes, the intensity distribution of lymph node parenchyma and fat varies spatially because of acoustic attenuation and focusing effects. Thus, the intensity contrast between two object regions (e.g., lymph node parenchyma and fat) is also spatially varying. In our previous work, nested graph cut demonstrated its ability to simultaneously segment lymph node parenchyma, fat, and the outer phosphate-buffered saline bath even when some boundaries are lost because of acoustic attenuation and focusing effects. This paper describes a novel approach called graph cut with locally adaptive energy to further deal with spatially varying distributions of lymph node parenchyma and fat caused by inhomogeneous acoustic attenuation. The proposed method achieved Dice similarity coefficients of 0.937+-0.035 when compared to expert manual segmentation on a representative dataset consisting of 115 three-dimensional lymph node images obtained from colorectal cancer patients. △ Less

Submitted 19 May, 2017; originally announced May 2017.

Showing 1–15 of 15 results for author: Mamou, J