-
ChatShop: Interactive Information Seeking with Language Agents
Authors:
Sanxing Chen,
Sam Wiseman,
Bhuwan Dhingra
Abstract:
The desire and ability to seek new information strategically are fundamental to human learning but often overlooked in current language agent evaluation. We analyze a popular web shop** task designed to test language agents' ability to perform strategic exploration and discover that it can be reformulated and solved as a single-turn retrieval task without the need for interactive information see…
▽ More
The desire and ability to seek new information strategically are fundamental to human learning but often overlooked in current language agent evaluation. We analyze a popular web shop** task designed to test language agents' ability to perform strategic exploration and discover that it can be reformulated and solved as a single-turn retrieval task without the need for interactive information seeking. This finding encourages us to rethink realistic constraints on information access that would necessitate strategic information seeking. We then redesign the task to introduce a notion of task ambiguity and the role of a shopper, serving as a dynamic party with whom the agent strategically interacts in an open-ended conversation to make informed decisions. Our experiments demonstrate that the proposed task can effectively evaluate the agent's ability to explore and gradually accumulate information through multi-turn interactions. Additionally, we show that large language model-simulated shoppers serve as a good proxy for real human shoppers, revealing similar error patterns in agents.
△ Less
Submitted 16 June, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Authors:
Brandon McKinzie,
Zhe Gan,
Jean-Philippe Fauconnier,
Sam Dodge,
Bowen Zhang,
Philipp Dufter,
Dhruti Shah,
Xianzhi Du,
Futang Peng,
Floris Weers,
Anton Belyi,
Haotian Zhang,
Karanjeet Singh,
Doug Kang,
Ankur Jain,
Hongyu Hè,
Max Schwarzer,
Tom Gunter,
Xiang Kong,
Aonan Zhang,
Jianyu Wang,
Chong Wang,
Nan Du,
Tao Lei,
Sam Wiseman
, et al. (7 additional authors not shown)
Abstract:
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la…
▽ More
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
△ Less
Submitted 18 April, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Seq2seq is All You Need for Coreference Resolution
Authors:
Wenzheng Zhang,
Sam Wiseman,
Karl Stratos
Abstract:
Existing works on coreference resolution suggest that task-specific models are necessary to achieve state-of-the-art performance. In this work, we present compelling evidence that such models are not necessary. We finetune a pretrained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation. Despite the extreme simplicity, our model outperforms or clos…
▽ More
Existing works on coreference resolution suggest that task-specific models are necessary to achieve state-of-the-art performance. In this work, we present compelling evidence that such models are not necessary. We finetune a pretrained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation. Despite the extreme simplicity, our model outperforms or closely matches the best coreference systems in the literature on an array of datasets. We also propose an especially simple seq2seq approach that generates only tagged spans rather than the spans interleaved with the original text. Our analysis shows that the model size, the amount of supervision, and the choice of sequence representations are key factors in performance.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
BM25 Query Augmentation Learned End-to-End
Authors:
Xiaoyin Chen,
Sam Wiseman
Abstract:
Given BM25's enduring competitiveness as an information retrieval baseline, we investigate to what extent it can be even further improved by augmenting and re-weighting its sparse query-vector representation. We propose an approach to learning an augmentation and a re-weighting end-to-end, and we find that our approach improves performance over BM25 while retaining its speed. We furthermore find t…
▽ More
Given BM25's enduring competitiveness as an information retrieval baseline, we investigate to what extent it can be even further improved by augmenting and re-weighting its sparse query-vector representation. We propose an approach to learning an augmentation and a re-weighting end-to-end, and we find that our approach improves performance over BM25 while retaining its speed. We furthermore find that the learned augmentations and re-weightings transfer well to unseen datasets.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Approximating CKY with Transformers
Authors:
Ghazal Khalighinejad,
Ollie Liu,
Sam Wiseman
Abstract:
We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a sentence's parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluat…
▽ More
We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a sentence's parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluate the viability of this approach for parsing under \textit{random} PCFGs. Here we find that performance declines as the grammar becomes more ambiguous, suggesting that the transformer is not fully capturing the CKY computation. However, we also find that incorporating additional inductive bias is helpful, and we propose a novel approach that makes use of gradients with respect to chart representations in predicting the parse, in analogy with the CKY algorithm being a subgradient of a partition function variant with respect to the chart.
△ Less
Submitted 4 November, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing
Authors:
Divyansh Agarwal,
Alexander R. Fabbri,
Simeng Han,
Wojciech Kryściński,
Faisal Ladhak,
Bryan Li,
Kathleen McKeown,
Dragomir Radev,
Tianyi Zhang,
Sam Wiseman
Abstract:
This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique cha…
▽ More
This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work in the field.
△ Less
Submitted 6 December, 2022; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
On Generalization in Coreference Resolution
Authors:
Shubham Toshniwal,
Patrick Xia,
Sam Wiseman,
Karen Livescu,
Kevin Gimpel
Abstract:
While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and meta…
▽ More
While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and metadata differ, we propose a method for jointly training a single model on this heterogeneous data mixture by using data augmentation to account for annotation differences and sampling to balance the data quantities. We find that in a zero-shot setting, models trained on a single dataset transfer poorly while joint training yields improved overall performance, leading to better generalization in coreference resolution models. This work contributes a new benchmark for robust coreference resolution and multiple new state-of-the-art results.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
SummScreen: A Dataset for Abstractive Screenplay Summarization
Authors:
Mingda Chen,
Zewei Chu,
Sam Wiseman,
Kevin Gimpel
Abstract:
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succi…
▽ More
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.
△ Less
Submitted 6 June, 2022; v1 submitted 14 April, 2021;
originally announced April 2021.
-
Chess as a Testbed for Language Model State Tracking
Authors:
Shubham Toshniwal,
Sam Wiseman,
Karen Livescu,
Kevin Gimpel
Abstract:
Transformer language models have made tremendous strides in natural language understanding tasks. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simp…
▽ More
Transformer language models have made tremendous strides in natural language understanding tasks. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. Moreover, we observe that the appropriate choice of chess notation allows for directly probing the world state, without requiring any additional probing-related machinery. We find that: (a) With enough training data, transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences. (b) For small training sets providing access to board state information during training can yield significant improvements. (c) The success of transformer language models is dependent on access to the entire game history i.e. "full attention". Approximating this full attention results in a significant performance drop. We propose this testbed as a benchmark for future work on the development and analysis of transformer language models.
△ Less
Submitted 13 May, 2022; v1 submitted 25 February, 2021;
originally announced February 2021.
-
Data-to-text Generation by Splicing Together Nearest Neighbors
Authors:
Sam Wiseman,
Arturs Backurs,
Karl Stratos
Abstract:
We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from "neighbor" source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standar…
▽ More
We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from "neighbor" source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standard techniques for training such a policy require an oracle derivation for each generation, and we prove that finding the shortest such derivation can be reduced to parsing under a particular weighted context-free grammar. We find that policies learned in this way perform on par with strong baselines in terms of automatic and human evaluation, but allow for more interpretable and controllable generation.
△ Less
Submitted 28 October, 2021; v1 submitted 20 January, 2021;
originally announced January 2021.
-
WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections
Authors:
Mingda Chen,
Sam Wiseman,
Kevin Gimpel
Abstract:
Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of…
▽ More
Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.
△ Less
Submitted 1 June, 2021; v1 submitted 29 December, 2020;
originally announced December 2020.
-
Exemplar-Controllable Paraphrasing and Translation using Bitext
Authors:
Mingda Chen,
Sam Wiseman,
Kevin Gimpel
Abstract:
Most prior work on exemplar-based syntactically controlled paraphrase generation relies on automatically-constructed large-scale paraphrase datasets, which are costly to create. We sidestep this prerequisite by adapting models from prior work to be able to learn solely from bilingual text (bitext). Despite only using bitext for training, and in near zero-shot conditions, our single proposed model…
▽ More
Most prior work on exemplar-based syntactically controlled paraphrase generation relies on automatically-constructed large-scale paraphrase datasets, which are costly to create. We sidestep this prerequisite by adapting models from prior work to be able to learn solely from bilingual text (bitext). Despite only using bitext for training, and in near zero-shot conditions, our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions. To evaluate these tasks quantitatively, we create three novel evaluation datasets. Our experimental results show that our models achieve competitive results on controlled paraphrase generation and strong performance on controlled machine translation. Analysis shows that our models learn to disentangle semantics and syntax in their latent representations, but still suffer from semantic drift.
△ Less
Submitted 17 September, 2021; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks
Authors:
Shubham Toshniwal,
Sam Wiseman,
Allyson Ettinger,
Karen Livescu,
Kevin Gimpel
Abstract:
Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires kee** all entities in memory, which can be impractical for long documents. We argue that kee** all entities in memory is unn…
▽ More
Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires kee** all entities in memory, which can be impractical for long documents. We argue that kee** all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.
△ Less
Submitted 16 November, 2020; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Discrete Latent Variable Representations for Low-Resource Text Classification
Authors:
Shuning **,
Sam Wiseman,
Karl Stratos,
Karen Livescu
Abstract:
While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the le…
▽ More
While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the learned representations as features for low-resource document and sentence classification. Our best models outperform the previous best reported results with continuous representations in these low-resource settings, while learning significantly more compressed representations. Interestingly, we find that an amortized variant of Hard EM performs particularly well in the lowest-resource regimes.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Authors:
Lifu Tu,
Richard Yuanzhe Pang,
Sam Wiseman,
Kevin Gimpel
Abstract:
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled cor…
▽ More
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.
△ Less
Submitted 12 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Learning Discrete Structured Representations by Adversarially Maximizing Mutual Information
Authors:
Karl Stratos,
Sam Wiseman
Abstract:
We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. Calculating mutual information is intractable in this setting. Our key technical contribution is an adversarial objective that can be used to tractably estimate mutual information assuming only the feasibility of cross entropy…
▽ More
We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. Calculating mutual information is intractable in this setting. Our key technical contribution is an adversarial objective that can be used to tractably estimate mutual information assuming only the feasibility of cross entropy calculation. We develop a concrete realization of this general formulation with Markov distributions over binary encodings. We report critical and unexpected findings on practical aspects of the objective such as the choice of variational priors. We apply our model on document hashing and show that it outperforms current best baselines based on discrete and vector quantized variational autoencoders. It also yields highly compressed interpretable representations.
△ Less
Submitted 15 July, 2020; v1 submitted 8 April, 2020;
originally announced April 2020.
-
Amortized Bethe Free Energy Minimization for Learning MRFs
Authors:
Sam Wiseman,
Yoon Kim
Abstract:
We propose to learn deep undirected graphical models (i.e., MRFs) with a non-ELBO objective for which we can calculate exact gradients. In particular, we optimize a saddle-point objective deriving from the Bethe free energy approximation to the partition function. Unlike much recent work in approximate inference, the derived objective requires no sampling, and can be efficiently computed even for…
▽ More
We propose to learn deep undirected graphical models (i.e., MRFs) with a non-ELBO objective for which we can calculate exact gradients. In particular, we optimize a saddle-point objective deriving from the Bethe free energy approximation to the partition function. Unlike much recent work in approximate inference, the derived objective requires no sampling, and can be efficiently computed even for very expressive MRFs. We furthermore amortize this optimization with trained inference networks. Experimentally, we find that the proposed approach compares favorably with loopy belief propagation, but is faster, and it allows for attaining better held out log likelihood than other recent approximate inference schemes.
△ Less
Submitted 17 November, 2019; v1 submitted 14 June, 2019;
originally announced June 2019.
-
Label-Agnostic Sequence Labeling by Copying Nearest Neighbors
Authors:
Sam Wiseman,
Karl Stratos
Abstract:
Retrieve-and-edit based approaches to structured prediction, where structures associated with retrieved neighbors are edited to form new structures, have recently attracted increased interest. However, much recent work merely conditions on retrieved structures (e.g., in a sequence-to-sequence framework), rather than explicitly manipulating them. We show we can perform accurate sequence labeling by…
▽ More
Retrieve-and-edit based approaches to structured prediction, where structures associated with retrieved neighbors are edited to form new structures, have recently attracted increased interest. However, much recent work merely conditions on retrieved structures (e.g., in a sequence-to-sequence framework), rather than explicitly manipulating them. We show we can perform accurate sequence labeling by explicitly (and only) copying labels from retrieved neighbors. Moreover, because this copying is label-agnostic, we can achieve impressive performance when transferring to new sequence-labeling tasks without retraining. We additionally consider a dynamic programming approach to sequence labeling in the presence of retrieved neighbors, which allows for controlling the number of distinct (copied) segments used to form a prediction, and leads to both more interpretable and accurate predictions.
△ Less
Submitted 20 August, 2021; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Controllable Paraphrase Generation with a Syntactic Exemplar
Authors:
Mingda Chen,
Qingming Tang,
Sam Wiseman,
Kevin Gimpel
Abstract:
Prior work on controllable text generation usually assumes that the controlled attribute can take on one of a small set of values known a priori. In this work, we propose a novel task, where the syntax of a generated sentence is controlled rather by a sentential exemplar. To evaluate quantitatively with standard metrics, we create a novel dataset with human annotations. We also develop a variation…
▽ More
Prior work on controllable text generation usually assumes that the controlled attribute can take on one of a small set of values known a priori. In this work, we propose a novel task, where the syntax of a generated sentence is controlled rather by a sentential exemplar. To evaluate quantitatively with standard metrics, we create a novel dataset with human annotations. We also develop a variational model with a neural module specifically designed for capturing syntactic knowledge and several multitask training objectives to promote disentangled representation learning. Empirically, the proposed model is observed to achieve improvements over baselines and learn to capture desirable characteristics.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations
Authors:
Mingda Chen,
Qingming Tang,
Sam Wiseman,
Kevin Gimpel
Abstract:
We propose a generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics. We show we can achieve better disentanglement between semantic and syntactic representations by training with multiple losses, including losses that exploit aligned paraphrastic sentences and word-order information. We also…
▽ More
We propose a generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics. We show we can achieve better disentanglement between semantic and syntactic representations by training with multiple losses, including losses that exploit aligned paraphrastic sentences and word-order information. We also investigate the effect of moving from bag-of-words to recurrent neural network modules. We evaluate our models as well as several popular pretrained embeddings on standard semantic similarity tasks and novel syntactic similarity tasks. Empirically, we find that the model with the best performing syntactic and semantic representations also gives rise to the most disentangled representations.
△ Less
Submitted 1 April, 2019;
originally announced April 2019.
-
A Tutorial on Deep Latent Variable Models of Natural Language
Authors:
Yoon Kim,
Sam Wiseman,
Alexander M. Rush
Abstract:
There has been much recent, exciting work on combining the complementary strengths of latent variable models and deep learning. Latent variable modeling makes it easy to explicitly specify model constraints through conditional independence properties, while deep learning makes it possible to parameterize these conditional likelihoods with powerful function approximators. While these "deep latent v…
▽ More
There has been much recent, exciting work on combining the complementary strengths of latent variable models and deep learning. Latent variable modeling makes it easy to explicitly specify model constraints through conditional independence properties, while deep learning makes it possible to parameterize these conditional likelihoods with powerful function approximators. While these "deep latent variable" models provide a rich, flexible framework for modeling many real-world phenomena, difficulties exist: deep parameterizations of conditional likelihoods usually make posterior inference intractable, and latent variable objectives often complicate backpropagation by introducing points of non-differentiability. This tutorial explores these issues in depth through the lens of variational inference.
△ Less
Submitted 4 August, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Entity Tracking Improves Cloze-style Reading Comprehension
Authors:
Luong Hoang,
Sam Wiseman,
Alexander M. Rush
Abstract:
Reading comprehension tasks test the ability of models to process long-term context and remember salient information. Recent work has shown that relatively simple neural methods such as the Attention Sum-Reader can perform well on these tasks; however, these systems still significantly trail human performance. Analysis suggests that many of the remaining hard instances are related to the inability…
▽ More
Reading comprehension tasks test the ability of models to process long-term context and remember salient information. Recent work has shown that relatively simple neural methods such as the Attention Sum-Reader can perform well on these tasks; however, these systems still significantly trail human performance. Analysis suggests that many of the remaining hard instances are related to the inability to track entity-references throughout documents. This work focuses on these hard entity tracking cases with two extensions: (1) additional entity features, and (2) training with a multi-task tracking objective. We show that these simple modifications improve performance both independently and in combination, and we outperform the previous state of the art on the LAMBADA dataset, particularly on difficult entity examples.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Learning Neural Templates for Text Generation
Authors:
Sam Wiseman,
Stuart M. Shieber,
Alexander M. Rush
Abstract:
While neural, encoder-decoder models have had significant empirical success in text generation, there remain several unaddressed problems with this style of generation. Encoder-decoder models are largely (a) uninterpretable, and (b) difficult to control in terms of their phrasing or content. This work proposes a neural generation system using a hidden semi-markov model (HSMM) decoder, which learns…
▽ More
While neural, encoder-decoder models have had significant empirical success in text generation, there remain several unaddressed problems with this style of generation. Encoder-decoder models are largely (a) uninterpretable, and (b) difficult to control in terms of their phrasing or content. This work proposes a neural generation system using a hidden semi-markov model (HSMM) decoder, which learns latent, discrete templates jointly with learning to generate. We show that this model learns useful templates, and that these templates make generation both more interpretable and controllable. Furthermore, we show that this approach scales to real data sets and achieves strong performance nearing that of encoder-decoder text generation models.
△ Less
Submitted 17 June, 2019; v1 submitted 30 August, 2018;
originally announced August 2018.
-
Semi-Amortized Variational Autoencoders
Authors:
Yoon Kim,
Sam Wiseman,
Andrew C. Miller,
David Sontag,
Alexander M. Rush
Abstract:
Amortized variational inference (AVI) replaces instance-specific local inference with a global inference network. While AVI has enabled efficient training of deep generative models such as variational autoencoders (VAE), recent empirical work suggests that inference networks can produce suboptimal variational parameters. We propose a hybrid approach, to use AVI to initialize the variational parame…
▽ More
Amortized variational inference (AVI) replaces instance-specific local inference with a global inference network. While AVI has enabled efficient training of deep generative models such as variational autoencoders (VAE), recent empirical work suggests that inference networks can produce suboptimal variational parameters. We propose a hybrid approach, to use AVI to initialize the variational parameters and run stochastic variational inference (SVI) to refine them. Crucially, the local SVI procedure is itself differentiable, so the inference network and generative model can be trained end-to-end with gradient-based optimization. This semi-amortized approach enables the use of rich generative models without experiencing the posterior-collapse phenomenon common in training VAEs for problems like text generation. Experiments show this approach outperforms strong autoregressive and variational baselines on standard text and image datasets.
△ Less
Submitted 23 July, 2018; v1 submitted 7 February, 2018;
originally announced February 2018.
-
Challenges in Data-to-Document Generation
Authors:
Sam Wiseman,
Stuart M. Shieber,
Alexander M. Rush
Abstract:
Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive d…
▽ More
Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive documents, propose a series of extractive evaluation methods for analyzing performance, and obtain baseline results using current neural generation methods. Experiments show that these models produce fluent text, but fail to convincingly approximate human-generated documents. Moreover, even templated baselines exceed the performance of these neural models on some metrics, though copy- and reconstruction-based extensions lead to noticeable improvements.
△ Less
Submitted 25 July, 2017;
originally announced July 2017.
-
Training Language Models Using Target-Propagation
Authors:
Sam Wiseman,
Sumit Chopra,
Marc'Aurelio Ranzato,
Arthur Szlam,
Ruoyu Sun,
Soumith Chintala,
Nicolas Vasilache
Abstract:
While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experim…
▽ More
While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work.
△ Less
Submitted 15 February, 2017;
originally announced February 2017.
-
Free Energy Computation by Monte Carlo Integration
Authors:
Matthew Clark,
Jeffrey S. Wiseman
Abstract:
The principles behind the computation of protein-ligand binding free energies by Monte Carlo integration are described in detail. The simulation provides gas-phase binding free energies that can be converted to aqueous energies by solvation corrections. The direct integration simulation has several characteristics beneficial to free-energy calculations. One is that the number of parameters that mu…
▽ More
The principles behind the computation of protein-ligand binding free energies by Monte Carlo integration are described in detail. The simulation provides gas-phase binding free energies that can be converted to aqueous energies by solvation corrections. The direct integration simulation has several characteristics beneficial to free-energy calculations. One is that the number of parameters that must be set for the simulation is small and can be determined objectively, making the outcome more deterministic, with respect to choice of input conditions, as compared to perturbation methods. Second, the simulation is free from assumptions about the starting pose or nature of the binding site. A final benefit is that binding free energies are a direct outcome of the simulation, and little processing is required to determine them.
The well-studied T4 lysozyme experimental free energy data and crystal structures were used to evaluate the method.
△ Less
Submitted 18 January, 2017;
originally announced January 2017.
-
Sequence-to-Sequence Learning as Beam-Search Optimization
Authors:
Sam Wiseman,
Alexander M. Rush
Abstract:
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Dau…
▽ More
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daume III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.
△ Less
Submitted 9 November, 2016; v1 submitted 9 June, 2016;
originally announced June 2016.
-
Learning Global Features for Coreference Resolution
Authors:
Sam Wiseman,
Alexander M. Rush,
Stuart M. Shieber
Abstract:
There is compelling evidence that coreference prediction would benefit from modeling global information about entity-clusters. Yet, state-of-the-art performance can be achieved with systems treating each mention prediction independently, which we attribute to the inherent difficulty of crafting informative cluster-level features. We instead propose to use recurrent neural networks (RNNs) to learn…
▽ More
There is compelling evidence that coreference prediction would benefit from modeling global information about entity-clusters. Yet, state-of-the-art performance can be achieved with systems treating each mention prediction independently, which we attribute to the inherent difficulty of crafting informative cluster-level features. We instead propose to use recurrent neural networks (RNNs) to learn latent, global representations of entity clusters directly from their mentions. We show that such representations are especially useful for the prediction of pronominal mentions, and can be incorporated into an end-to-end coreference system that outperforms the state of the art without requiring any additional search.
△ Less
Submitted 11 April, 2016;
originally announced April 2016.
-
MRO/CRISM Retrieval of Surface Lambert Albedos for Multispectral Map** of Mars with DISORT-based Rad. Transfer Modeling: Phase 1 - Using Historical Climatology for Temperatures, Aerosol Opacities, & Atmo. Pressures
Authors:
P. C. McGuire,
M. J. Wolff,
M. D. Smith,
R. E. Arvidson,
S. L. Murchie,
R. T. Clancy,
T. L. Roush,
S. C. Cull,
K. A. Lichtenberg,
S. M. Wiseman,
R. O. Green,
T. Z. Martin,
R. E. Milliken,
P. J. Cavender,
D. C. Humm,
F. P. Seelos,
K. D. Seelos,
H. W. Taylor,
B. L. Ehlmann,
J. F. Mustard,
S. M. Pelkey,
T. N. Titus,
C. D. Hash,
E. R. Malaret,
the CRISM Team
Abstract:
We discuss the DISORT-based radiative transfer pipeline ('CRISM_LambertAlb') for atmospheric and thermal correction of MRO/CRISM data acquired in multispectral map** mode (~200 m/pixel, 72 spectral channels). Currently, in this phase-one version of the system, we use aerosol optical depths, surface temperatures, and lower-atmospheric temperatures, all from climatology derived from Mars Global…
▽ More
We discuss the DISORT-based radiative transfer pipeline ('CRISM_LambertAlb') for atmospheric and thermal correction of MRO/CRISM data acquired in multispectral map** mode (~200 m/pixel, 72 spectral channels). Currently, in this phase-one version of the system, we use aerosol optical depths, surface temperatures, and lower-atmospheric temperatures, all from climatology derived from Mars Global Surveyor Thermal Emission Spectrometer (MGS-TES) data, and surface altimetry derived from MGS Mars Orbiter Laser Altimeter (MOLA). The DISORT-based model takes as input the dust and ice aerosol optical depths (scaled to the CRISM wavelength range), the surface pressures (computed from MOLA altimetry, MGS-TES lower-atmospheric thermometry, and Viking-based pressure climatology), the surface temperatures, the reconstructed instrumental photometric angles, and the measured I/F spectrum, and then outputs a Lambertian albedo spectrum. The Lambertian albedo spectrum is valuable geologically since it allows the mineralogical composition to be estimated. Here, I/F is defined as the ratio of the radiance measured by CRISM to the solar irradiance at Mars divided by $π$. After discussing the capabilities and limitations of the pipeline software system, we demonstrate its application on several multispectral data cubes: the outer northern ice cap of Mars, Tyrrhena Terra, and near the landing site for the Phoenix mission. For the icy spectra near the northern polar cap, aerosols need to be included in order to properly correct for the CO_2 absorption in the H_{2}O ice bands at wavelengths near 2.0 $μ$m. In future phases of software development, we intend to use CRISM data directly in order to retrieve the spatiotemporal maps of aerosol optical depths, surface pressure and surface temperature.
△ Less
Submitted 21 March, 2009;
originally announced March 2009.
-
Self-Averaging, Distribution of Pseudo-Critical Temperatures and Finite Size Scaling in Critical Disordered Systems
Authors:
S. Wiseman,
E. Domany
Abstract:
The distributions $P(X)$ of singular thermodynamic quantities in an ensemble of quenched random samples of linear size $l$ at the critical point $T_c$ are studied by Monte Carlo in two models. Our results confirm predictions of Aharony and Harris based on Renormalization group considerations. For an Ashkin-Teller model with strong but irrelevant bond randomness we find that the relative squared…
▽ More
The distributions $P(X)$ of singular thermodynamic quantities in an ensemble of quenched random samples of linear size $l$ at the critical point $T_c$ are studied by Monte Carlo in two models. Our results confirm predictions of Aharony and Harris based on Renormalization group considerations. For an Ashkin-Teller model with strong but irrelevant bond randomness we find that the relative squared width, $R_X$, of $P(X)$ is weakly self averaging. $R_X\sim l^{α/ν}$, where $α$ is the specific heat exponent and $ν$ is the correlation length exponent of the pure model fixed point governing the transition. For the site dilute Ising model on a cubic lattice, known to be governed by a random fixed point, we find that $R_X$ tends to a universal constant independent of the amount of dilution (no self averaging). However this constant is different for canonical and grand canonical disorder. We study the distribution of the pseudo-critical temperatures $T_c(i,l)$ of the ensemble defined as the temperatures of the maximum susceptibility of each sample. We find that its variance scales as $(δT_c(l))^2 \sim l^{-2/ν}$ and NOT as $\sim l^{-d}. We find that $R_χ$ is reduced by a factor of $\sim 70$ with respect to $R_χ(T_c)$ by measuring $χ$ of each sample at $T_c(i,l)$. We analyze correlations between the magnetization at criticality $m_i(T_c,l)$ and the pseudo-critical temperature $T_c(i,l)$ in terms of a sample independent finite size scaling function of a sample dependent reduced temperature $(T-T_c(i,l))/T_c$. This function is found to be universal and to behave similarly to pure systems.
△ Less
Submitted 10 February, 1998;
originally announced February 1998.
-
Lack of Self Averaging and Finite Size Scaling in Critical Disordered Systems
Authors:
S. Wiseman,
E. Domany
Abstract:
We simulated site dilute Ising models in $d=3$ dimensions for several lattice sizes $L$. For each $L$ singular thermodynamic quantities $X$ were measured at criticality and their distributions $P(X)$ were determined, for ensembles of several thousand random samples. For $L \to \infty$ the width of $P(X)$ tends to a universal constant, i.e. there is no self averaging. The width of the distributio…
▽ More
We simulated site dilute Ising models in $d=3$ dimensions for several lattice sizes $L$. For each $L$ singular thermodynamic quantities $X$ were measured at criticality and their distributions $P(X)$ were determined, for ensembles of several thousand random samples. For $L \to \infty$ the width of $P(X)$ tends to a universal constant, i.e. there is no self averaging. The width of the distribution of the sample dependent pseudocritical temperatures $T_c(i,L)$ scales as $δT_c(L) \sim L^{-1/ν}$ and NOT as $\sim L^{-d/2}$. Finite size scaling holds; the sample dependence of $X_i(T_c)$ enters predominantly through $T_c(i,L)$.
△ Less
Submitted 10 February, 1998; v1 submitted 9 February, 1998;
originally announced February 1998.
-
Data clustering using a model granular magnet
Authors:
Marcelo Blatt,
Shai Wiseman,
Eytan Domany
Abstract:
We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phas…
▽ More
We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures it is completely ordered; all spins are aligned. At very high temperatures the system does not exhibit any ordering and in an intermediate regime clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method.
△ Less
Submitted 9 February, 1997;
originally announced February 1997.
-
Lack of Self-Averaging in Critical Disordered Systems
Authors:
S. Wiseman,
E. Domany
Abstract:
We consider the sample to sample fluctuations that occur in the value of a thermodynamic quantity $P$ in an ensemble of finite systems with quenched disorder, at equilibrium. The variance of $P$, $V_{P}$, which characterizes these fluctuations is calculated as a function of the systems' linear size $l$, focusing on the behavior at the critical point. The specific model considered is the bond-dis…
▽ More
We consider the sample to sample fluctuations that occur in the value of a thermodynamic quantity $P$ in an ensemble of finite systems with quenched disorder, at equilibrium. The variance of $P$, $V_{P}$, which characterizes these fluctuations is calculated as a function of the systems' linear size $l$, focusing on the behavior at the critical point. The specific model considered is the bond-disordered Ashkin-Teller model on a square lattice. Using Monte Carlo simulations, several bond-disordered Ashkin-Teller
models were examined, including the bond-disordered Ising model and the bond-disordered four-state Potts model. It was found that far from criticality
the energy, magnetization, specific heat and susceptibility are strongly self averaging, that is $V_{P}\sim l^{-d}$ (where $d=2$ is the dimension). At criticality though, the results indicate that the magnetization $M$ and the susceptibility $χ$ are non self averaging, i.e. $\frac{V_χ}{χ^{2}},
\frac{V_{M}}{M^{2}}\not \rightarrow 0$. The energy $E$ at criticality is weakly self averaging, that is $V_{E}\sim l^{-y_{v}}$ with $0<y_{v}<d$. Less conclusively, and possibly only as a transient behavior, the specific heat too is found to be weakly self averaging. A phenomenological theory of finite size scaling for disordered systems is developed. Its main prediction is that when the specific heat exponent $α<0$ ($α$ of the disordered model) then, for a quantity $P$ which scales as $l^ρ$ at criticality, its variance $V_{P}$ will scale asymptotically as $l^{2ρ+\fracαν}$. we found very good agreement between the theory and the data for $V_χ$ and $V_{E}$.
△ Less
Submitted 23 June, 1995; v1 submitted 22 June, 1995;
originally announced June 1995.
-
Critical behaviour of the Random--Bond Ashkin--Teller Model, a Monte-Carlo study
Authors:
S. Wiseman,
E. Domany
Abstract:
The critical behaviour of a bond-disordered Ashkin-Teller model on a square lattice is investigated by intensive Monte-Carlo simulations. A duality transformation is used to locate a critical plane of the disordered model. This critical plane corresponds to the line of critical points of the pure model, along which critical exponents vary continuously. Along this line the scaling exponent corres…
▽ More
The critical behaviour of a bond-disordered Ashkin-Teller model on a square lattice is investigated by intensive Monte-Carlo simulations. A duality transformation is used to locate a critical plane of the disordered model. This critical plane corresponds to the line of critical points of the pure model, along which critical exponents vary continuously. Along this line the scaling exponent corresponding to randomness $φ=(α/ν)$ varies continuously and is positive so that randomness is relevant and different critical behaviour is expected for the disordered model. We use a cluster algorithm for the Monte Carlo simulations based on the Wolff embedding idea, and perform a finite size scaling study of several critical models, extrapolating between the critical bond-disordered Ising and bond-disordered four state Potts models. The critical behaviour of the disordered model is compared with the critical behaviour of an anisotropic Ashkin-Teller model which is used as a refference pure model. We find no essential change in the order parameters' critical exponents with respect to those of the pure model. The divergence of the specific heat $C$ is changed dramatically. Our results favor a logarithmic type divergence at $T_{c}$, $C\sim \log L$ for the random bond Ashkin-Teller and four state Potts models and $C\sim \log \log L$ for the random bond Ising model.
△ Less
Submitted 11 November, 1994;
originally announced November 1994.
-
A Cluster Method for the Ashkin--Teller Model
Authors:
S. Wiseman,
E. Domany
Abstract:
A cluster Monte Carlo algorithm for the Ashkin-Teller (AT) model is constructed according to the guidelines of a general scheme for such algorithms. Its dynamical behaviour is tested for the square lattice AT model. We perform simulations on the line of critical points along which the exponents vary continuously, and find that critical slowing down is significantly reduced. We find continuous va…
▽ More
A cluster Monte Carlo algorithm for the Ashkin-Teller (AT) model is constructed according to the guidelines of a general scheme for such algorithms. Its dynamical behaviour is tested for the square lattice AT model. We perform simulations on the line of critical points along which the exponents vary continuously, and find that critical slowing down is significantly reduced. We find continuous variation of the dynamical exponent $z$ along the line, following the variation of the ratio $α/ν$, in a manner which satisfies the Li-Sokal bound $z_{cluster}\geqα/ν$, that was so far proved only for Potts models.
△ Less
Submitted 14 October, 1993;
originally announced October 1993.