Search | arXiv e-print repository

Exploring and Improving Drafts in Blockwise Parallel Decoding

Authors: Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton

Abstract: Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verifie… ▽ More Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets. △ Less

Submitted 5 June, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

arXiv:2310.13167 [pdf, other]

Visualizing Causality in Mixed Reality for Manual Task Learning: An Exploratory Study

Authors: Rahul Jain, **gyu Shi, Andrew Benton, Moiz Rasheed, Hyungjun Doh, Subramanian Chidambaram, Karthik Ramani

Abstract: Mixed Reality (MR) is gaining prominence in manual task skill learning due to its in-situ, embodied, and immersive experience. To teach manual tasks, current methodologies break the task into hierarchies (tasks into subtasks) and visualize the current subtask and future in terms of causality. Existing psychology literature also shows that humans learn tasks by breaking them into hierarchies. In or… ▽ More Mixed Reality (MR) is gaining prominence in manual task skill learning due to its in-situ, embodied, and immersive experience. To teach manual tasks, current methodologies break the task into hierarchies (tasks into subtasks) and visualize the current subtask and future in terms of causality. Existing psychology literature also shows that humans learn tasks by breaking them into hierarchies. In order to understand the design space of information visualized to the learner for better task understanding, we conducted a user study with 48 users. The study was conducted using a complex assembly task, which involves learning of both actions and tool usage. We aim to explore the effect of visualization of causality in the hierarchy for manual task learning in MR by four options: no causality, event level causality, interaction level causality, and gesture level causality. The results show that the user understands and performs best when all the level of causality is shown to the user. Based on the results, we further provide design recommendations and in-depth discussions for future manual task learning systems. △ Less

Submitted 31 January, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

arXiv:2309.14894 [pdf, other]

Verifiable Learned Behaviors via Motion Primitive Composition: Applications to Scoo** of Granular Media

Authors: Andrew Benton, Eugen Solowjow, Prithvi Akella

Abstract: A robotic behavior model that can reliably generate behaviors from natural language inputs in real time would substantially expedite the adoption of industrial robots due to enhanced system flexibility. To facilitate these efforts, we construct a framework in which learned behaviors, created by a natural language abstractor, are verifiable by construction. Leveraging recent advancements in motion… ▽ More A robotic behavior model that can reliably generate behaviors from natural language inputs in real time would substantially expedite the adoption of industrial robots due to enhanced system flexibility. To facilitate these efforts, we construct a framework in which learned behaviors, created by a natural language abstractor, are verifiable by construction. Leveraging recent advancements in motion primitives and probabilistic verification, we construct a natural-language behavior abstractor that generates behaviors by synthesizing a directed graph over the provided motion primitives. If these component motion primitives are constructed according to the criteria we specify, the resulting behaviors are probabilistically verifiable. We demonstrate this verifiable behavior generation capacity in both simulation on an exploration task and on hardware with a robot scoo** granular media. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2301.10371 [pdf, other]

Weakly Supervised Headline Dependency Parsing

Authors: Adrian Benton, Tianze Shi, Ozan İrsoy, Igor Malioutov

Abstract: English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to… ▽ More English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: Findings of EMNLP 2022

ACM Class: I.2.7

Journal ref: In Proceedings of Findings of EMNLP 2022

arXiv:2205.11505 [pdf, other]

What Makes Data-to-Text Generation Hard for Pretrained Language Models?

Authors: Moniba Keymanesh, Adrian Benton, Mark Dredze

Abstract: Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories. Previous work shows that pre-trained language models(PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can gene… ▽ More Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories. Previous work shows that pre-trained language models(PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how these data are incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: 15 pages, 5 figures

arXiv:2109.07488 [pdf, other]

Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph

Authors: Sameer Bansal, Adrian Benton

Abstract: Nickel and Kiela (2017) present a new method for embedding tree nodes in the Poincare ball, and suggest that these hyperbolic embeddings are far more effective than Euclidean embeddings at embedding nodes in large, hierarchically structured graphs like the WordNet nouns hypernymy tree. This is especially true in low dimensions (Nickel and Kiela, 2017, Table 1). In this work, we seek to reproduce t… ▽ More Nickel and Kiela (2017) present a new method for embedding tree nodes in the Poincare ball, and suggest that these hyperbolic embeddings are far more effective than Euclidean embeddings at embedding nodes in large, hierarchically structured graphs like the WordNet nouns hypernymy tree. This is especially true in low dimensions (Nickel and Kiela, 2017, Table 1). In this work, we seek to reproduce their experiments on embedding and reconstructing the WordNet nouns hypernymy graph. Counter to what they report, we find that Euclidean embeddings are able to represent this tree at least as well as Poincare embeddings, when allowed at least 50 dimensions. We note that this does not diminish the significance of their work given the impressive performance of hyperbolic embeddings in very low-dimensional settings. However, given the wide influence of their work, our aim here is to present an updated and more accurate comparison between the Euclidean and hyperbolic embeddings. △ Less

Submitted 15 September, 2021; originally announced September 2021.

ACM Class: I.2.7

arXiv:2109.07483 [pdf, other]

Cross-Register Projection for Headline Part of Speech Tagging

Authors: Adrian Benton, Hanyang Li, Igor Malioutov

Abstract: Part of speech (POS) tagging is a familiar NLP task. State of the art taggers routinely achieve token-level accuracies of over 97% on news body text, evidence that the problem is well understood. However, the register of English news headlines, "headlinese", is very different from the register of long-form text, causing POS tagging models to underperform on headlines. In this work, we automaticall… ▽ More Part of speech (POS) tagging is a familiar NLP task. State of the art taggers routinely achieve token-level accuracies of over 97% on news body text, evidence that the problem is well understood. However, the register of English news headlines, "headlinese", is very different from the register of long-form text, causing POS tagging models to underperform on headlines. In this work, we automatically annotate news headlines with POS tags by projecting predicted tags from corresponding sentences in news bodies. We train a multi-domain POS tagger on both long-form and headline text and show that joint training on both registers improves over training on just one or naively concatenating training sets. We evaluate on a newly-annotated corpus of over 5,248 English news headlines from the Google sentence compression corpus, and show that our model yields a 23% relative error reduction per token and 19% per headline. In addition, we demonstrate that better headline POS tags can improve the performance of a syntax-based open information extraction system. We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: EMNLP 2021

ACM Class: I.2.7

arXiv:2104.13936 [pdf, other]

Diversity-Aware Batch Active Learning for Dependency Parsing

Authors: Tianze Shi, Adrian Benton, Igor Malioutov, Ozan İrsoy

Abstract: While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate w… ▽ More While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversityaware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: NAACL 2021

ACM Class: I.2.7

Journal ref: In Proceedings of NAACL 2021

arXiv:2012.15332 [pdf, other]

Corrected CBOW Performs as well as Skip-gram

Authors: Ozan İrsoy, Adrian Benton, Karl Stratos

Abstract: Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official i… ▽ More Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official implementation, word2vec.c, and Gensim. We show that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train. △ Less

Submitted 9 November, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

Comments: Presented at WINR at EMNLP 2021, added discussion about FastText, more discussion about findings, additional results on C4 data, wording changes

arXiv:1812.00436 [pdf, other]

Learning Representations of Social Media Users

Authors: Adrian Benton

Abstract: User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a… ▽ More User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message. △ Less

Submitted 2 December, 2018; originally announced December 2018.

Comments: PhD thesis

arXiv:1712.03538 [pdf, other]

Multi-Task Learning for Mental Health using Social Media Text

Authors: Adrian Benton, Margaret Mitchell, Dirk Hovy

Abstract: We introduce initial groundwork for estimating suicide risk and mental health in a deep learning framework. By modeling multiple conditions, the system learns to make predictions about suicide risk and mental health at a low false positive rate. Conditions are modeled as tasks in a multi-task learning (MTL) framework, with gender prediction as an additional auxiliary task. We demonstrate the effec… ▽ More We introduce initial groundwork for estimating suicide risk and mental health in a deep learning framework. By modeling multiple conditions, the system learns to make predictions about suicide risk and mental health at a low false positive rate. Conditions are modeled as tasks in a multi-task learning (MTL) framework, with gender prediction as an additional auxiliary task. We demonstrate the effectiveness of multi-task learning by comparison to a well-tuned single-task baseline with the same number of parameters. Our best MTL model predicts potential suicide attempt, as well as the presence of atypical mental health, with AUC > 0.8. We also find additional large improvements using multi-task learning on mental health tasks with limited training data. △ Less

Submitted 10 December, 2017; originally announced December 2017.

ACM Class: I.2.7

Journal ref: Proceedings of the 15th Conference of the EACL (2017) 152-162

arXiv:1702.02519 [pdf, other]

Deep Generalized Canonical Correlation Analysis

Authors: Adrian Benton, Huda Khayrallah, Biman Gujral, Dee Ann Reisinger, Sheng Zhang, Raman Arora

Abstract: We present Deep Generalized Canonical Correlation Analysis (DGCCA) -- a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1… ▽ More We present Deep Generalized Canonical Correlation Analysis (DGCCA) -- a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA representations on two distinct datasets for three downstream tasks: phonetic transcription from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques. △ Less

Submitted 14 June, 2017; v1 submitted 8 February, 2017; originally announced February 2017.

Comments: 14 pages, 6 figures

arXiv:1610.02060 [pdf, other]

After Sandy Hook Elementary: A Year in the Gun Control Debate on Twitter

Authors: Adrian Benton, Braden Hancock, Glen Coppersmith, John W. Ayers, Mark Dredze

Abstract: The mass shooting at Sandy Hook elementary school on December 14, 2012 catalyzed a year of active debate and legislation on gun control in the United States. Social media hosted an active public discussion where people expressed their support and opposition to a variety of issues surrounding gun legislation. In this paper, we show how a content-based analysis of Twitter data can provide insights a… ▽ More The mass shooting at Sandy Hook elementary school on December 14, 2012 catalyzed a year of active debate and legislation on gun control in the United States. Social media hosted an active public discussion where people expressed their support and opposition to a variety of issues surrounding gun legislation. In this paper, we show how a content-based analysis of Twitter data can provide insights and understanding into this debate. We estimate the relative support and opposition to gun control measures, along with a topic analysis of each camp by analyzing over 70 million gun-related tweets from 2013. We focus on spikes in conversation surrounding major events related to guns throughout the year. Our general approach can be applied to other important public health and political issues to analyze the prevalence and nature of public opinion. △ Less

Submitted 6 October, 2016; originally announced October 2016.

Comments: Presented at the Data For Good Exchange 2016

arXiv:0801.4019 [pdf, other]

A Class of Convex Polyhedra with Few Edge Unfoldings

Authors: Alex Benton, Joseph O'Rourke

Abstract: We construct a sequence of convex polyhedra on n vertices with the property that, as n -> infinity, the fraction of its edge unfoldings that avoid overlap approaches 0, and so the fraction that overlap approaches 1. Nevertheless, each does have (several) nonoverlap** edge unfoldings. We construct a sequence of convex polyhedra on n vertices with the property that, as n -> infinity, the fraction of its edge unfoldings that avoid overlap approaches 0, and so the fraction that overlap approaches 1. Nevertheless, each does have (several) nonoverlap** edge unfoldings. △ Less

Submitted 25 January, 2008; originally announced January 2008.

Comments: 12 pages, 9 figures

Report number: Smith Computer Science 088 ACM Class: F.2.2

Showing 1–14 of 14 results for author: Benton, A