Search | arXiv e-print repository

SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur P. Parikh

Abstract: Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensi… ▽ More Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation. △ Less

Submitted 1 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2303.04562 [pdf, other]

Extrapolative Controlled Sequence Generation via Iterative Refinement

Authors: Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P. Parikh

Abstract: We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their att… ▽ More We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation. △ Less

Submitted 7 June, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: ICML 2023 - Camera Ready Version

arXiv:2211.08714 [pdf, other]

Reward Gaming in Conditional Text Generation

Authors: Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, He He

Abstract: To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring sp… ▽ More To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (NLG) community using concrete conditional text generation examples and discuss potential fixes and areas for future work. △ Less

Submitted 1 June, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: ACL 2023

arXiv:2210.11693 [pdf, other]

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Authors: Ran Tian, Ankur P. Parikh

Abstract: We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5,… ▽ More We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70% training steps and time, while requiring <=51% memory for slot variables. Our code is open-sourced at: https://github.com/google-research/jestimator △ Less

Submitted 21 November, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

arXiv:2210.06324 [pdf, other]

doi 10.1109/ICASSP49357.2023.10094909

SQuId: Measuring Speech Naturalness in Many Languages

Authors: Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, Jason Riesa

Abstract: Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 loca… ▽ More Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments. △ Less

Submitted 1 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted at ICASSP 2023, with additional material in the appendix

arXiv:2205.11588 [pdf, other]

Simple Recurrence Improves Masked Language Models

Authors: Tao Lei, Ran Tian, Jasmijn Bastings, Ankur P. Parikh

Abstract: In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring… ▽ More In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while kee** the number of parameters constant. For example, our base model achieves an absolute improvement of 2.1 points averaged across 10 tasks and also demonstrates increased stability in fine-tuning over a range of learning rates. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2110.08467 [pdf, other]

Improving Compositional Generalization with Self-Training for Data-to-Text Generation

Authors: Sanket Vaibhav Mehta, **feng Rao, Yi Tay, Mihir Kale, Ankur P. Parikh, Emma Strubell

Abstract: Data-to-text generation focuses on generating fluent natural language responses from structured meaning representations (MRs). Such representations are compositional and it is costly to collect responses for all possible combinations of atomic meaning schemata, thereby necessitating few-shot generalization to novel MRs. In this work, we systematically study the compositional generalization of the… ▽ More Data-to-text generation focuses on generating fluent natural language responses from structured meaning representations (MRs). Such representations are compositional and it is costly to collect responses for all possible combinations of atomic meaning schemata, thereby necessitating few-shot generalization to novel MRs. In this work, we systematically study the compositional generalization of the state-of-the-art T5 models in few-shot data-to-text tasks. We show that T5 models fail to generalize to unseen MRs, and we propose a template-based input representation that considerably improves the model's generalization capability. To further improve the model's performance, we propose an approach based on self-training using fine-tuned BLEURT for pseudo response selection. On the commonly-used SGD and Weather benchmarks, the proposed self-training approach improves tree accuracy by 46%+ and reduces the slot error rates by 73%+ over the strong T5 baselines in few-shot settings. △ Less

Submitted 11 April, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

Comments: Accepted at ACL 2022 main conference

arXiv:2110.06341 [pdf, other]

Learning Compact Metrics for MT

Authors: Amy Pu, Hyung Won Chung, Ankur P. Parikh, Sebastian Gehrmann, Thibault Sellam

Abstract: Recent developments in machine translation and multilingual text generation have led researchers to adopt trained metrics such as COMET or BLEURT, which treat evaluation as a regression problem and use representations from multilingual pre-trained models such as XLM-RoBERTa or mBERT. Yet studies on related tasks suggest that these models are most efficient when they are large, which is costly and… ▽ More Recent developments in machine translation and multilingual text generation have led researchers to adopt trained metrics such as COMET or BLEURT, which treat evaluation as a regression problem and use representations from multilingual pre-trained models such as XLM-RoBERTa or mBERT. Yet studies on related tasks suggest that these models are most efficient when they are large, which is costly and impractical for evaluation. We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task. We present a series of experiments which show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck, by leveraging synthetic data generation and transferring knowledge from one teacher to multiple students trained on related languages. Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT's performance using only a third of its parameters. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted at EMNLP 2021

arXiv:2108.13032 [pdf, other]

Shatter: An Efficient Transformer Encoder with Single-Headed Self-Attention and Relative Sequence Partitioning

Authors: Ran Tian, Joshua Maynez, Ankur P. Parikh

Abstract: The highly popular Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT, that have become an enduring paradigm in NLP. While powerful, the computational resources and time required to pretrain such models can be prohibitive. In this work, we present an alternative self-attention architecture, Shatter, that more efficiently encodes sequence in… ▽ More The highly popular Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT, that have become an enduring paradigm in NLP. While powerful, the computational resources and time required to pretrain such models can be prohibitive. In this work, we present an alternative self-attention architecture, Shatter, that more efficiently encodes sequence information by softly partitioning the space of relative positions and applying different value matrices to different parts of the sequence. This mechanism further allows us to simplify the multi-headed attention in Transformer to single-headed. We conduct extensive experiments showing that Shatter achieves better performance than BERT, with pretraining being faster per step (15% on TPU), converging in fewer steps, and offering considerable memory savings (>50%). Put together, Shatter can be pretrained on 8 V100 GPUs in 7 days, and match the performance of BERT_Base -- making the cost of pretraining much more affordable. △ Less

Submitted 30 August, 2021; originally announced August 2021.

arXiv:2103.06799 [pdf, other]

Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution

Authors: Xavier Garcia, Noah Constant, Ankur P. Parikh, Orhan Firat

Abstract: We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models, paving the way towards efficient continual learning for multilingual machine translation. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts, incurs only minor degradation on the translation performance for the origin… ▽ More We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models, paving the way towards efficient continual learning for multilingual machine translation. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts, incurs only minor degradation on the translation performance for the original language pairs and provides competitive performance even in the case where we only possess monolingual data for the new languages. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: Accepted at NAACL 2021

arXiv:2010.04297 [pdf, other]

Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task

Authors: Thibault Sellam, Amy Pu, Hyung Won Chung, Sebastian Gehrmann, Qijun Tan, Markus Freitag, Dipanjan Das, Ankur P. Parikh

Abstract: The quality of machine translation systems has dramatically improved over the last decade, and as a result, evaluation has become an increasingly challenging problem. This paper describes our contribution to the WMT 2020 Metrics Shared Task, the main benchmark for automatic evaluation of translation. We make several submissions based on BLEURT, a previously published metric based on transfer learn… ▽ More The quality of machine translation systems has dramatically improved over the last decade, and as a result, evaluation has become an increasingly challenging problem. This paper describes our contribution to the WMT 2020 Metrics Shared Task, the main benchmark for automatic evaluation of translation. We make several submissions based on BLEURT, a previously published metric based on transfer learning. We extend the metric beyond English and evaluate it on 14 language pairs for which fine-tuning data is available, as well as 4 "zero-shot" language pairs, for which we have no labelled examples. Additionally, we focus on English to German and demonstrate how to combine BLEURT's predictions with those of YiSi and use alternative reference translations to enhance the performance. Empirical results show that the models achieve competitive results on the WMT Metrics 2019 Shared Task, indicating their promise for the 2020 edition. △ Less

Submitted 19 October, 2020; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2009.11201 [pdf, other]

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

Authors: Xavier Garcia, Aditya Siddhant, Orhan Firat, Ankur P. Parikh

Abstract: Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems p… ▽ More Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models. △ Less

Submitted 12 March, 2021; v1 submitted 23 September, 2020; originally announced September 2020.

Comments: Accepted to NAACL 2021

arXiv:2004.14373 [pdf, other]

ToTTo: A Controlled Table-To-Text Generation Dataset

Authors: Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, Dipanjan Das

Abstract: We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revis… ▽ More We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation. △ Less

Submitted 6 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: Accepted to EMNLP 2020

arXiv:2004.04696 [pdf, other]

BLEURT: Learning Robust Metrics for Text Generation

Authors: Thibault Sellam, Dipanjan Das, Ankur P. Parikh

Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-tr… ▽ More Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution. △ Less

Submitted 21 May, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: Accepted at ACL 2020

arXiv:2002.02955 [pdf, ps, other]

A Multilingual View of Unsupervised Machine Translation

Authors: Xavier Garcia, Pierre Foret, Thibault Sellam, Ankur P. Parikh

Abstract: We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxi… ▽ More We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxiliary parallel data that contains the other. This auxiliary data can naturally be utilized in our probabilistic framework via a novel cross-translation loss term. Empirically, we show that our approach results in higher BLEU scores over state-of-the-art unsupervised models on the WMT'14 English-French, WMT'16 English-German, and WMT'16 English-Romanian datasets in most directions. In particular, we obtain a +1.65 BLEU advantage over the best-performing unsupervised model in the Romanian-English direction. △ Less

Submitted 16 October, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: Accepted at Findings of EMNLP 2020 [Fixed processing error.]

arXiv:1910.12366 [pdf, other]

Thieves on Sesame Street! Model Extraction of BERT-based APIs

Authors: Kalpesh Krishna, Gaurav Singh Tomar, Ankur P. Parikh, Nicolas Papernot, Mohit Iyyer

Abstract: We study the problem of model extraction in natural language processing, in which an adversary with only query access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary and victim model fine-tune a large pretrained language model such as BERT (Devlin et al. 2019), we show that the adversary does not need any real training data to successfully mou… ▽ More We study the problem of model extraction in natural language processing, in which an adversary with only query access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary and victim model fine-tune a large pretrained language model such as BERT (Devlin et al. 2019), we show that the adversary does not need any real training data to successfully mount the attack. In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random sequences of words coupled with task-specific heuristics form effective queries for model extraction on a diverse set of NLP tasks, including natural language inference and question answering. Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model. Finally, we study two defense strategies against model extraction---membership classification and API watermarking---which while successful against naive adversaries, are ineffective against more sophisticated ones. △ Less

Submitted 12 October, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

Comments: ICLR 2020 Camera Ready (19 pages)

arXiv:1910.08684 [pdf, other]

Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation

Authors: Ran Tian, Shashi Narayan, Thibault Sellam, Ankur P. Parikh

Abstract: We address the issue of hallucination in data-to-text generation, i.e., reducing the generation of text that is unsupported by the source. We conjecture that hallucination can be caused by an encoder-decoder model generating content phrases without attending to the source; so we propose a confidence score to ensure that the model attends to the source whenever necessary, as well as a variational B… ▽ More We address the issue of hallucination in data-to-text generation, i.e., reducing the generation of text that is unsupported by the source. We conjecture that hallucination can be caused by an encoder-decoder model generating content phrases without attending to the source; so we propose a confidence score to ensure that the model attends to the source whenever necessary, as well as a variational Bayes training framework that can learn the score from data. Experiments on the WikiBio (Lebretet al., 2016) dataset show that our approach is more faithful to the source than existing state-of-the-art approaches, according to both PARENT score (Dhingra et al., 2019) and human evaluation. We also report strong results on the WebNLG (Gardent et al., 2017) dataset. △ Less

Submitted 2 November, 2020; v1 submitted 18 October, 2019; originally announced October 2019.

arXiv:1906.05807 [pdf, other]

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index

Authors: Minjoon Seo, **hyuk Lee, Tom Kwiatkowski, Ankur P. Parikh, Ali Farhadi, Hannaneh Hajishirzi

Abstract: Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand for every input query. In this paper, we introduce the query-agnostic indexable representation of document phrases that can drastically speed up open-domain QA and also allows us to reach long-tail targets. In particular, our dense-sparse phrase enc… ▽ More Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand for every input query. In this paper, we introduce the query-agnostic indexable representation of document phrases that can drastically speed up open-domain QA and also allows us to reach long-tail targets. In particular, our dense-sparse phrase encoding effectively captures syntactic, semantic, and lexical information of the phrases and eliminates the pipeline filtering of context documents. Leveraging optimization strategies, our model can be trained in a single 4-GPU server and serve entire Wikipedia (up to 60 billion phrases) under 2TB with CPUs only. Our experiments on SQuAD-Open show that our model is more accurate than DrQA (Chen et al., 2017) with 6000x reduced computational cost, which translates into at least 58x faster end-to-end inference benchmark on CPUs. △ Less

Submitted 14 June, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: ACL 2019; Code & demo available at https://nlp.cs.washington.edu/denspi/ ; Added comparison to Weaver (Raison et al., 2018)

arXiv:1904.04428 [pdf, other]

Text Generation with Exemplar-based Adaptive Decoding

Authors: Hao Peng, Ankur P. Parikh, Manaal Faruqui, Bhuwan Dhingra, Dipanjan Das

Abstract: We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e., what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given input text; to produce the output, it retrieves exe… ▽ More We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e., what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given input text; to produce the output, it retrieves exemplar text from the training data as "soft templates," which are then used to construct an exemplar-specific decoder. We evaluate the proposed model on abstractive text summarization and data-to-text generation. Empirical results show that this model achieves strong performance and outperforms comparable baselines. △ Less

Submitted 10 April, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: NAACL 2019

arXiv:1904.02338 [pdf, other]

Consistency by Agreement in Zero-shot Neural Machine Translation

Authors: Maruan Al-Shedivat, Ankur P. Parikh

Abstract: Generalization and reliability of multilingual translation often highly depend on the amount of available parallel data for each language pair of interest. In this paper, we focus on zero-shot generalization---a challenging setup that tests models on translation directions they have not been optimized for at training time. To solve the problem, we (i) reformulate multilingual translation as probab… ▽ More Generalization and reliability of multilingual translation often highly depend on the amount of available parallel data for each language pair of interest. In this paper, we focus on zero-shot generalization---a challenging setup that tests models on translation directions they have not been optimized for at training time. To solve the problem, we (i) reformulate multilingual translation as probabilistic inference, (ii) define the notion of zero-shot consistency and show why standard training often results in models unsuitable for zero-shot tasks, and (iii) introduce a consistent agreement-based training method that encourages the model to produce equivalent translations of parallel sentences in auxiliary languages. We test our multilingual NMT models on multiple public zero-shot translation benchmarks (IWSLT17, UN corpus, Europarl) and show that agreement-based learning often results in 2-3 BLEU zero-shot improvement over strong baselines without any loss in performance on supervised translation directions. △ Less

Submitted 10 April, 2019; v1 submitted 3 April, 2019; originally announced April 2019.

Comments: NAACL 2019 (14 pages, 5 figures)

arXiv:1808.01687 [pdf, ps, other]

Hybrid Subspace Learning for High-Dimensional Data

Authors: Micol Marchetti-Bowick, Benjamin J. Lengerich, Ankur P. Parikh, Eric P. Xing

Abstract: The high-dimensional data setting, in which p >> n, is a challenging statistical paradigm that appears in many real-world problems. In this setting, learning a compact, low-dimensional representation of the data can substantially help distinguish signal from noise. One way to achieve this goal is to perform subspace learning to estimate a small set of latent features that capture the majority of t… ▽ More The high-dimensional data setting, in which p >> n, is a challenging statistical paradigm that appears in many real-world problems. In this setting, learning a compact, low-dimensional representation of the data can substantially help distinguish signal from noise. One way to achieve this goal is to perform subspace learning to estimate a small set of latent features that capture the majority of the variance in the original data. Most existing subspace learning models, such as PCA, assume that the data can be fully represented by its embedding in one or more latent subspaces. However, in this work, we argue that this assumption is not suitable for many high-dimensional datasets; often only some variables can easily be projected to a low-dimensional space. We propose a hybrid dimensionality reduction technique in which some features are mapped to a low-dimensional subspace while others remain in the original space. Our model leads to more accurate estimation of the latent space and lower reconstruction error. We present a simple optimization procedure for the resulting biconvex problem and show synthetic data results that demonstrate the advantages of our approach over existing methods. Finally, we demonstrate the effectiveness of this method for extracting meaningful features from both gene expression and video background subtraction datasets. △ Less

Submitted 5 August, 2018; originally announced August 2018.

arXiv:1804.07726 [pdf, other]

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension

Authors: Minjoon Seo, Tom Kwiatkowski, Ankur P. Parikh, Ali Farhadi, Hannaneh Hajishirzi

Abstract: We formalize a new modular variant of current question answering tasks by enforcing complete independence of the document encoder from the question encoder. This formulation addresses a key challenge in machine comprehension by requiring a standalone representation of the document discourse. It additionally leads to a significant scalability advantage since the encoding of the answer candidate phr… ▽ More We formalize a new modular variant of current question answering tasks by enforcing complete independence of the document encoder from the question encoder. This formulation addresses a key challenge in machine comprehension by requiring a standalone representation of the document discourse. It additionally leads to a significant scalability advantage since the encoding of the answer candidate phrases in the document can be pre-computed and indexed offline for efficient retrieval. We experiment with baseline models for the new task, which achieve a reasonable accuracy but significantly underperform unconstrained QA models. We invite the QA research community to engage in Phrase-Indexed Question Answering (PIQA, pika) for closing the gap. The leaderboard is at: nlp.cs.washington.edu/piqa △ Less

Submitted 26 September, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: EMNLP 2018 short; 6 pages

arXiv:1711.00894 [pdf, other]

Multi-Mention Learning for Reading Comprehension with Neural Cascades

Authors: Swabha Swayamdipta, Ankur P. Parikh, Tom Kwiatkowski

Abstract: Reading comprehension is a challenging task, especially when executed across longer or across multiple evidence documents, where the answer is likely to reoccur. Existing neural architectures typically do not scale to the entire evidence, and hence, resort to selecting a single passage in the document (either via truncation or other means), and carefully searching for the answer within that passag… ▽ More Reading comprehension is a challenging task, especially when executed across longer or across multiple evidence documents, where the answer is likely to reoccur. Existing neural architectures typically do not scale to the entire evidence, and hence, resort to selecting a single passage in the document (either via truncation or other means), and carefully searching for the answer within that passage. However, in some cases, this strategy can be suboptimal, since by focusing on a specific passage, it becomes difficult to leverage multiple mentions of the same answer throughout the document. In this work, we take a different approach by constructing lightweight models that are combined in a cascade to find the answer. Each submodel consists only of feed-forward networks equipped with an attention mechanism, making it trivially parallelizable. We show that our approach can scale to approximately an order of magnitude larger evidence documents and can aggregate information at the representation level from multiple mentions of each answer candidate across the document. Empirically, our approach achieves state-of-the-art performance on both the Wikipedia and web domains of the TriviaQA dataset, outperforming more complex, recurrent architectures. △ Less

Submitted 30 May, 2018; v1 submitted 2 November, 2017; originally announced November 2017.

Comments: Proceedings of ICLR 2018

arXiv:1606.01933 [pdf, other]

A Decomposable Attention Model for Natural Language Inference

Authors: Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit

Abstract: We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. On the Stanford Natural Language Inference (SNLI) dataset, we obtain state-of-the-art results with almost an order of magnitude fewer parameters than previous work and without relying on… ▽ More We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. On the Stanford Natural Language Inference (SNLI) dataset, we obtain state-of-the-art results with almost an order of magnitude fewer parameters than previous work and without relying on any word-order information. Adding intra-sentence attention that takes a minimum amount of order into account yields further improvements. △ Less

Submitted 25 September, 2016; v1 submitted 6 June, 2016; originally announced June 2016.

Comments: 7 pages, 1 figure, Proceeedings of EMNLP 2016

arXiv:1401.3413 [pdf, other]

Infinite Mixed Membership Matrix Factorization

Authors: Avneesh Saluja, Mahdi Pakdaman, Dongzhen Piao, Ankur P. Parikh

Abstract: Rating and recommendation systems have become a popular application area for applying a suite of machine learning techniques. Current approaches rely primarily on probabilistic interpretations and extensions of matrix factorization, which factorizes a user-item ratings matrix into latent user and item vectors. Most of these methods fail to model significant variations in item ratings from otherwis… ▽ More Rating and recommendation systems have become a popular application area for applying a suite of machine learning techniques. Current approaches rely primarily on probabilistic interpretations and extensions of matrix factorization, which factorizes a user-item ratings matrix into latent user and item vectors. Most of these methods fail to model significant variations in item ratings from otherwise similar users, a phenomenon known as the "Napoleon Dynamite" effect. Recent efforts have addressed this problem by adding a contextual bias term to the rating, which captures the mood under which a user rates an item or the context in which an item is rated by a user. In this work, we extend this model in a nonparametric sense by learning the optimal number of moods or contexts from the data, and derive Gibbs sampling inference procedures for our model. We evaluate our approach on the MovieLens 1M dataset, and show significant improvements over the optimal parametric baseline, more than twice the improvements previously encountered for this task. We also extract and evaluate a DBLP dataset, wherein we predict the number of papers co-authored by two authors, and present improvements over the parametric baseline on this alternative domain as well. △ Less

Submitted 14 January, 2014; originally announced January 2014.

Comments: For ICDM 2013 Workshop Proceedings

arXiv:1312.7077 [pdf, other]

Language Modeling with Power Low Rank Ensembles

Authors: Ankur P. Parikh, Avneesh Saluja, Chris Dyer, Eric P. Xing

Abstract: We present power low rank ensembles (PLRE), a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method can be understood as a generalization of n-gram modeling to non-integer n, and includes standard techniques such as absolute discounting and Kneser-Ney smoothing as special ca… ▽ More We present power low rank ensembles (PLRE), a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method can be understood as a generalization of n-gram modeling to non-integer n, and includes standard techniques such as absolute discounting and Kneser-Ney smoothing as special cases. PLRE training is efficient and our approach outperforms state-of-the-art modified Kneser Ney baselines in terms of perplexity on large corpora as well as on BLEU score in a downstream machine translation task. △ Less

Submitted 3 October, 2014; v1 submitted 26 December, 2013; originally announced December 2013.

arXiv:1210.4884 [pdf]

A Spectral Algorithm for Latent Junction Trees

Authors: Ankur P. Parikh, Le Song, Mariya Ishteva, Gabi Teodoru, Eric P. Xing

Abstract: Latent variable models are an elegant framework for capturing rich probabilistic dependencies in many applications. However, current approaches typically parametrize these models using conditional probability tables, and learning relies predominantly on local search heuristics such as Expectation Maximization. Using tensor algebra, we propose an alternative parameterization of latent variable mode… ▽ More Latent variable models are an elegant framework for capturing rich probabilistic dependencies in many applications. However, current approaches typically parametrize these models using conditional probability tables, and learning relies predominantly on local search heuristics such as Expectation Maximization. Using tensor algebra, we propose an alternative parameterization of latent variable models (where the model structures are junction trees) that still allows for computation of marginals among observed variables. While this novel representation leads to a moderate increase in the number of parameters for junction trees of low treewidth, it lets us design a local-minimum-free algorithm for learning this parameterization. The main computation of the algorithm involves only tensor operations and SVDs which can be orders of magnitude faster than EM algorithms for large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. We demonstrate the advantages of our method on synthetic and real datasets. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-675-684

arXiv:1010.1868 [pdf, other]

Infinite Hierarchical MMSB Model for Nested Communities/Groups in Social Networks

Authors: Qirong Ho, Ankur P. Parikh, Le Song, Eric P. Xing

Abstract: Actors in realistic social networks play not one but a number of diverse roles depending on whom they interact with, and a large number of such role-specific interactions collectively determine social communities and their organizations. Methods for analyzing social networks should capture these multi-faceted role-specific interactions, and, more interestingly, discover the latent organization or… ▽ More Actors in realistic social networks play not one but a number of diverse roles depending on whom they interact with, and a large number of such role-specific interactions collectively determine social communities and their organizations. Methods for analyzing social networks should capture these multi-faceted role-specific interactions, and, more interestingly, discover the latent organization or hierarchy of social communities. We propose a hierarchical Mixed Membership Stochastic Blockmodel to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross-community interactions. Furthermore, to automatically discover these latent structures from social networks, we develop a Gibbs sampling algorithm for our model. We conduct extensive validation of our model using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks. △ Less

Submitted 9 October, 2010; originally announced October 2010.

Showing 1–28 of 28 results for author: Parikh, A P