Skip to main content

Showing 1–38 of 38 results for author: Wieting, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.14517  [pdf, other

    cs.LG cs.AI cs.CL cs.CR

    PostMark: A Robust Blackbox Watermark for Large Language Models

    Authors: Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Wieting, Mohit Iyyer

    Abstract: The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. I… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: preprint; 18 pages, 5 figures

  2. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  4. arXiv:2311.05800  [pdf, other

    cs.IR cs.AI cs.CL

    Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

    Authors: Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer

    Abstract: There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop SWIM-I… ▽ More

    Submitted 15 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted at NAACL 2024. Data released at https://github.com/google-research-datasets/swim-ir

  5. arXiv:2310.14542  [pdf, other

    cs.CL

    Evaluating Large Language Models on Controlled Generation Tasks

    Authors: Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wieting, Nanyun Peng, Xuezhe Ma

    Abstract: While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different gr… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  6. arXiv:2309.04663  [pdf, other

    cs.CL cs.AI

    FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning

    Authors: Xinyi Wang, John Wieting, Jonathan H. Clark

    Abstract: Learning paradigms for large language models (LLMs) currently tend to fall within either in-context learning (ICL) or full fine-tuning. Each of these comes with their own trade-offs based on available data, model size, compute cost, ease-of-use, and final quality with neither solution performing well across-the-board. In this article, we first describe ICL and fine-tuning paradigms in a way that h… ▽ More

    Submitted 12 September, 2023; v1 submitted 8 September, 2023; originally announced September 2023.

  7. arXiv:2305.14332  [pdf, other

    cs.CL

    Evaluating and Modeling Attribution for Cross-Lingual Question Answering

    Authors: Benjamin Muller, John Wieting, Jonathan H. Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Baldini Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang

    Abstract: Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve tr… ▽ More

    Submitted 15 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Published as a long paper at EMNLP 2023

  8. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

    Authors: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson , et al. (2 additional authors not shown)

    Abstract: Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  9. arXiv:2305.10403  [pdf, other

    cs.CL cs.AI

    PaLM 2 Technical Report

    Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

    Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More

    Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  10. arXiv:2303.16750  [pdf, other

    cs.IR cs.DL cs.LG

    A Gold Standard Dataset for the Reviewer Assignment Problem

    Authors: Ivan Stelmakh, John Wieting, Graham Neubig, Nihar B. Shah

    Abstract: Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"--a numerical estimate of the expertise of a reviewer in reviewing a paper--and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison,… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

  11. arXiv:2303.13408  [pdf, other

    cs.CL cs.CR cs.LG

    Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

    Authors: Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer

    Abstract: The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B… ▽ More

    Submitted 17 October, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: NeurIPS 2023 camera ready (32 pages). Code, models, data available in https://github.com/martiansideofthemoon/ai-detection-paraphrases

  12. arXiv:2212.10726  [pdf, other

    cs.CL cs.LG

    Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval

    Authors: John Wieting, Jonathan H. Clark, William W. Cohen, Graham Neubig, Taylor Berg-Kirkpatrick

    Abstract: Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in $N$ languages and, through an approxi… ▽ More

    Submitted 4 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Published as a long paper at ACL 2023

  13. arXiv:2210.14250  [pdf, other

    cs.CL

    Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

    Authors: Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, Mohit Iyyer

    Abstract: Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than m… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  14. arXiv:2207.00630  [pdf, other

    cs.AI

    QA Is the New KR: Question-Answer Pairs as Knowledge Bases

    Authors: Wenhu Chen, William W. Cohen, Michiel De Jong, Nitish Gupta, Alessandro Presta, Pat Verga, John Wieting

    Abstract: In this position paper, we propose a new approach to generating a type of knowledge base (KB) from text, based on question generation and entity linking. We argue that the proposed type of KB has many of the key advantages of a traditional symbolic KB: in particular, it consists of small modular components, which can be combined compositionally to answer complex queries, including relational queri… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

  15. arXiv:2205.09726  [pdf, other

    cs.CL cs.LG

    RankGen: Improving Text Generation with Large Ranking Models

    Authors: Kalpesh Krishna, Yapei Chang, John Wieting, Mohit Iyyer

    Abstract: Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues we present RankGen, a 1.2B parameter encoder model for English that scores model generations given a prefix. RankGen can be flexibly incorpora… ▽ More

    Submitted 14 November, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022 (34 pages), model checkpoints available at https://github.com/martiansideofthemoon/rankgen. Added comparisons to newer decoding methods (contrastive search, contrastive decoding, eta sampling)

  16. arXiv:2204.13761  [pdf, other

    cs.CL

    Faithful to the Document or to the World? Mitigating Hallucinations via Entity-linked Knowledge in Abstractive Summarization

    Authors: Yue Dong, John Wieting, Pat Verga

    Abstract: Despite recent advances in abstractive summarization, current summarization systems still suffer from content hallucinations where models generate text that is either irrelevant or contradictory to the source document. However, prior work has been predicated on the assumption that any generated facts not appearing explicitly in the source are undesired hallucinations. Methods have been proposed to… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: 12 pages, 5 figures

  17. arXiv:2204.04581  [pdf, other

    cs.CL cs.AI cs.LG

    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering

    Authors: Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, William Cohen

    Abstract: Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at… ▽ More

    Submitted 23 January, 2023; v1 submitted 9 April, 2022; originally announced April 2022.

    Comments: Accepted by EACL 2023

  18. arXiv:2110.13231  [pdf, other

    cs.CL

    Improving the Diversity of Unsupervised Paraphrasing with Embedding Outputs

    Authors: Monisha Jegadeesan, Sachin Kumar, John Wieting, Yulia Tsvetkov

    Abstract: We present a novel technique for zero-shot paraphrase generation. The key contribution is an end-to-end multilingual paraphrasing model that is trained using translated parallel corpora to generate paraphrases into "meaning spaces" -- replacing the final softmax layer with word embeddings. This architectural modification, plus a training procedure that incorporates an autoencoding objective, enabl… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

  19. arXiv:2110.08381  [pdf, other

    cs.CL

    On The Ingredients of an Effective Zero-shot Semantic Parser

    Authors: Pengcheng Yin, John Wieting, Avirup Sil, Graham Neubig

    Abstract: Semantic parsers map natural language utterances into meaning representations (e.g., programs). Such models are typically bottlenecked by the paucity of training data due to the required laborious annotation efforts. Recent studies have performed zero-shot learning by synthesizing training examples of canonical utterances and programs from a grammar, and further paraphrasing these utterances to im… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

  20. arXiv:2104.15114  [pdf, other

    cs.CL

    Paraphrastic Representations at Scale

    Authors: John Wieting, Kevin Gimpel, Graham Neubig, Taylor Berg-Kirkpatrick

    Abstract: We present a system that allows users to train their own state-of-the-art paraphrastic sentence representations in a variety of languages. We also release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese. We train these models on large amounts of data, achieving significantly improved performance from the original papers proposing the methods on a suite of… ▽ More

    Submitted 4 June, 2023; v1 submitted 30 April, 2021; originally announced April 2021.

    Comments: Published as a demo paper at EMNLP 2022

  21. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

    Authors: Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

    Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a m… ▽ More

    Submitted 18 May, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

    Comments: TACL Final Version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 73--91

  22. arXiv:2010.12771  [pdf, other

    cs.CL

    On Learning Text Style Transfer with Direct Rewards

    Authors: Yixin Liu, Graham Neubig, John Wieting

    Abstract: In most cases, the lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task. In this paper, we explore training algorithms that instead optimize reward functions that explicitly consider different aspects of the style-transferred outputs. In particular, we leverage semantic similarity metrics originally used for fine-tuning neural machine tr… ▽ More

    Submitted 13 May, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Published as a long paper at NAACL 2021

  23. arXiv:2010.05700  [pdf, other

    cs.CL

    Reformulating Unsupervised Style Transfer as Paraphrase Generation

    Authors: Kalpesh Krishna, John Wieting, Mohit Iyyer

    Abstract: Modern NLP defines the task of style transfer as modifying the style of a given sentence without appreciably changing its semantics, which implies that the outputs of style transfer systems should be paraphrases of their inputs. However, many existing systems purportedly designed for style transfer inherently warp the input's meaning through attribute transfer, which changes semantic properties su… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: EMNLP 2020 camera-ready (26 pages)

  24. arXiv:2003.01343  [pdf

    cs.CL

    Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

    Authors: Shuyan Zhou, Shruti Rijhwani, John Wieting, Jaime Carbonell, Graham Neubig

    Abstract: Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relati… ▽ More

    Submitted 3 March, 2020; originally announced March 2020.

    Comments: Accepted to TACL 2020

  25. arXiv:1911.03895  [pdf, other

    cs.CL cs.LG

    A Bilingual Generative Transformer for Semantic Sentence Embedding

    Authors: John Wieting, Graham Neubig, Taylor Berg-Kirkpatrick

    Abstract: Semantic sentence embedding models encode natural language sentences into vectors, such that closeness in embedding space indicates closeness in the semantics between the sentences. Bilingual data offers a useful signal for learning such embeddings: properties shared by both sentences in a translation pair are likely semantic, while divergent properties are likely stylistic or language-specific. W… ▽ More

    Submitted 19 November, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: Published as a long paper at EMNLP 2020

  26. arXiv:1909.13872  [pdf, other

    cs.CL

    Simple and Effective Paraphrastic Similarity from Parallel Translations

    Authors: John Wieting, Kevin Gimpel, Graham Neubig, Taylor Berg-Kirkpatrick

    Abstract: We present a model and methodology for learning paraphrastic sentence embeddings directly from bitext, removing the time-consuming intermediate step of creating paraphrase corpora. Further, we show that the resulting model can be applied to cross-lingual tasks where it both outperforms and is orders of magnitude faster than more complex state-of-the-art baselines.

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Published as a short paper at ACL 2019

  27. arXiv:1909.06694  [pdf, other

    cs.CL

    Beyond BLEU: Training Neural Machine Translation with Semantic Similarity

    Authors: John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, Graham Neubig

    Abstract: While most neural machine translation (NMT) systems are still trained using maximum likelihood estimation, recent work has demonstrated that optimizing systems to directly improve evaluation metrics such as BLEU can substantially improve final translation accuracy. However, training with BLEU has some limitations: it doesn't assign partial credit, it has a limited range of output values, and it ca… ▽ More

    Submitted 14 September, 2019; originally announced September 2019.

    Comments: Published as a long paper at ACL 2019

  28. arXiv:1903.07926  [pdf, other

    cs.CL

    compare-mt: A Tool for Holistic Comparison of Language Generation Systems

    Authors: Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, Xinyi Wang, John Wieting

    Abstract: In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or system improvement. It implements a number of tools to do so,… ▽ More

    Submitted 19 September, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: Updated and longer version of NAACL 2019 Demo Paper

  29. arXiv:1901.10444  [pdf, other

    cs.CL

    No Training Required: Exploring Random Encoders for Sentence Classification

    Authors: John Wieting, Douwe Kiela

    Abstract: We explore various methods for computing sentence representations from pre-trained word embeddings without any training, i.e., using nothing but random parameterizations. Our aim is to put sentence embeddings on more solid footing by 1) looking at how much modern sentence embeddings gain over random methods---as it turns out, surprisingly little; and by 2) providing the field with more appropriate… ▽ More

    Submitted 29 January, 2019; originally announced January 2019.

    Comments: Published as a conference paper at ICLR 2019

  30. arXiv:1804.06059  [pdf, other

    cs.CL

    Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

    Authors: Mohit Iyyer, John Wieting, Kevin Gimpel, Luke Zettlemoyer

    Abstract: We propose syntactically controlled paraphrase networks (SCPNs) and use them to generate adversarial examples. Given a sentence and a target syntactic form (e.g., a constituency parse), SCPNs are trained to produce a paraphrase of the sentence with the desired syntax. We show it is possible to create training data for this task by first doing backtranslation at a very large scale, and then using a… ▽ More

    Submitted 17 April, 2018; originally announced April 2018.

    Comments: NAACL 2018

  31. arXiv:1711.05732  [pdf, other

    cs.CL

    ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

    Authors: John Wieting, Kevin Gimpel

    Abstract: We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic know… ▽ More

    Submitted 20 April, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

  32. arXiv:1706.01847  [pdf, other

    cs.CL

    Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

    Authors: John Wieting, Jonathan Mallinson, Kevin Gimpel

    Abstract: We consider the problem of learning general-purpose, paraphrastic sentence embeddings in the setting of Wieting et al. (2016b). We use neural machine translation to generate sentential paraphrases via back-translation of bilingual sentence pairs. We evaluate the paraphrase pairs by their ability to serve as training data for learning paraphrastic sentence embeddings. We find that the data quality… ▽ More

    Submitted 6 June, 2017; originally announced June 2017.

  33. arXiv:1705.00364  [pdf, ps, other

    cs.CL

    Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings

    Authors: John Wieting, Kevin Gimpel

    Abstract: We consider the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b). While they found LSTM recurrent networks to underperform word averaging, we present several developments that together produce the opposite conclusion. These include training on sentence pairs rather than phrase pairs, averaging states to represent sequences, and… ▽ More

    Submitted 30 April, 2017; originally announced May 2017.

    Comments: Published as a long paper at ACL 2017

  34. arXiv:1607.02789  [pdf, ps, other

    cs.CL

    Charagram: Embedding Words and Sentences via Character n-grams

    Authors: John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu

    Abstract: We present Charagram embeddings, a simple approach for learning character-based compositional models to embed textual sequences. A word or sentence is represented using a character n-gram count vector, followed by a single nonlinear transformation to yield a low-dimensional embedding. We use three tasks for evaluation: word similarity, sentence similarity, and part-of-speech tagging. We demonstrat… ▽ More

    Submitted 10 July, 2016; originally announced July 2016.

  35. arXiv:1511.08198  [pdf, ps, other

    cs.CL cs.LG

    Towards Universal Paraphrastic Sentence Embeddings

    Authors: John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu

    Abstract: We consider the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database (Ganitkevitch et al., 2013). We compare six compositional architectures, evaluating them on annotated textual similarity datasets drawn both from the same distribution as the training data and from a wide range of other domains. We find that the most complex archi… ▽ More

    Submitted 4 March, 2016; v1 submitted 25 November, 2015; originally announced November 2015.

    Comments: Published as a conference paper at ICLR 2016

  36. arXiv:1508.06235  [pdf, other

    stat.ML cs.AI cs.LG stat.CO

    Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm

    Authors: Daniel Khashabi, John Wieting, Jeffrey Yufei Liu, Feng Liang

    Abstract: In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs s… ▽ More

    Submitted 31 October, 2015; v1 submitted 25 August, 2015; originally announced August 2015.

  37. arXiv:1506.03487  [pdf, ps, other

    cs.CL

    From Paraphrase Database to Compositional Paraphrase Model and Back

    Authors: John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu, Dan Roth

    Abstract: The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphra… ▽ More

    Submitted 26 August, 2015; v1 submitted 10 June, 2015; originally announced June 2015.

    Comments: 2015 TACL paper updated with an appendix describing new 300 dimensional embeddings. Submitted 1/2015. Accepted 2/2015. Published 6/2015

    Journal ref: TACL Vol 3 (2015) pg 345-358

  38. arXiv:1412.0751  [pdf, other

    cs.CL

    Tiered Clustering to Improve Lexical Entailment

    Authors: John Wieting

    Abstract: Many tasks in Natural Language Processing involve recognizing lexical entailment. Two different approaches to this problem have been proposed recently that are quite different from each other. The first is an asymmetric similarity measure designed to give high scores when the contexts of the narrower term in the entailment are a subset of those of the broader term. The second is a supervised appro… ▽ More

    Submitted 1 December, 2014; originally announced December 2014.

    Comments: Paper for course project for Advanced NLP Spring 2013. 8 pages