Skip to main content

Showing 1–17 of 17 results for author: Caswell, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2311.06440  [pdf, other

    cs.CL cs.LG

    Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text

    Authors: Isaac Caswell, Lisa Wang, Isabel Papadimitriou

    Abstract: Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A typical and insidious issue, affecting both training data and model output, is data that is repetitive and dominated by linguistically uninteresting boilerplate, such as price catalogs or computer-generated log… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: Accepted to GEM workshop 2023; 6 pages

  3. arXiv:2309.04662  [pdf, other

    cs.CL cs.LG

    MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

    Authors: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat

    Abstract: We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages usi… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: Preprint

  4. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

    Authors: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson , et al. (2 additional authors not shown)

    Abstract: Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  5. arXiv:2303.15265  [pdf, other

    cs.CL cs.AI cs.LG

    Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

    Authors: Alex Jones, Isaac Caswell, Ishank Saxena, Orhan Firat

    Abstract: Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translat… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    ACM Class: I.2.7

  6. arXiv:2205.03983  [pdf, other

    cs.CL cs.AI cs.LG

    Building Machine Translation Systems for the Next Thousand Languages

    Authors: Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yan** Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes

    Abstract: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and develo** data-driven filtering techniques; (ii) Develo**… ▽ More

    Submitted 6 July, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

    Comments: V2: updated with some details from 24-language Google Translate launch in May 2022 V3: spelling corrections, additional acknowledgements

  7. arXiv:2201.03110  [pdf, other

    cs.CL cs.LG

    Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning

    Authors: Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen, Isaac Caswell, Xavier Garcia

    Abstract: Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-Eng… ▽ More

    Submitted 13 January, 2022; v1 submitted 9 January, 2022; originally announced January 2022.

  8. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  9. arXiv:2010.14571  [pdf, other

    cs.CL cs.LG

    Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

    Authors: Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna

    Abstract: Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models o… ▽ More

    Submitted 29 October, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted to COLING 2020. 9 pages with 8 page abstract

  10. arXiv:2004.06063  [pdf, other

    cs.CL cs.AI cs.LG

    BLEU might be Guilty but References are not Innocent

    Authors: Markus Freitag, David Grangier, Isaac Caswell

    Abstract: The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for… ▽ More

    Submitted 20 October, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

    Comments: Accepted at EMNLP 2020

  11. arXiv:1911.03823  [pdf, other

    cs.CL

    Translationese as a Language in "Multilingual" NMT

    Authors: Parker Riley, Isaac Caswell, Markus Freitag, David Grangier

    Abstract: Machine translation has an undesirable propensity to produce "translationese" artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target te… ▽ More

    Submitted 9 July, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020) 7737-7746

  12. arXiv:1909.02197  [pdf, other

    cs.CL cs.LG

    Investigating Multilingual NMT Representations at Scale

    Authors: Sneha Reddy Kudugunta, Ankur Bapna, Isaac Caswell, Naveen Arivazhagan, Orhan Firat

    Abstract: Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with 103 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a repre… ▽ More

    Submitted 11 September, 2019; v1 submitted 4 September, 2019; originally announced September 2019.

    Comments: Paper at EMNLP 2019

  13. arXiv:1908.10940  [pdf, other

    cs.CL cs.LG

    Learning a Multi-Domain Curriculum for Neural Machine Translation

    Authors: Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, Zarana Parekh

    Abstract: Most data selection research in machine translation focuses on improving a single domain. We perform data selection for multiple domains at once. This is achieved by carefully introducing instance-level domain-relevance features and automatically constructing a training curriculum to gradually concentrate on multi-domain relevant and noise-reduced data batches. Both the choice of features and the… ▽ More

    Submitted 1 May, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

    Comments: Accepted at ACL2020

  14. arXiv:1906.06442  [pdf, other

    cs.CL cs.LG

    Tagged Back-Translation

    Authors: Isaac Caswell, Ciprian Chelba, David Grangier

    Abstract: Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding during back-translation, a method to generate synthetic parallel data. We show that the main role of such synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model that the given source is synthetic. We propose a simpler alternative t… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

    Comments: Accepted as oral presentation in WMT 2019; 9 pages; 9 tables; 1 figure

  15. arXiv:1906.01130  [pdf, other

    cs.CL cs.LG

    Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

    Authors: Wei Wang, Isaac Caswell, Ciprian Chelba

    Abstract: Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not explicitly examined. This paper introduces a "co-curricular learning" method to compose dynamic domain-data selection with dynamic clean-data selection,… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: 11 pages

    Journal ref: The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)

  16. arXiv:1904.04790  [pdf, other

    cs.CL

    APE at Scale and its Implications on MT Evaluation Biases

    Authors: Markus Freitag, Isaac Caswell, Scott Roy

    Abstract: In this work, we train an Automatic Post-Editing (APE) model and use it to reveal biases in standard Machine Translation (MT) evaluation procedures. The goal of our APE model is to correct typical errors introduced by the translation process, and convert the "translationese" output into natural text. Our APE model is trained entirely on monolingual data that has been round-trip translated through… ▽ More

    Submitted 14 June, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted at WMT 2019

  17. arXiv:1902.08295  [pdf, other

    cs.LG stat.ML

    Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

    Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

    Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.