Skip to main content

Showing 1–17 of 17 results for author: Bawden, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08707  [pdf, other

    cs.CL cs.CV

    mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

    Authors: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot

    Abstract: Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Preprint. Under review

  2. arXiv:2403.17220  [pdf, other

    cs.CL

    Making Sentence Embeddings Robust to User-Generated Content

    Authors: Lydia Nishimwe, Benoît Sagot, Rachel Bawden

    Abstract: NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their s… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  3. arXiv:2305.14012  [pdf, other

    cs.CL

    When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages

    Authors: Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, Rachel Bawden

    Abstract: Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related hig… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 9 pages, Accepted at LREC-COLING 2024

  4. arXiv:2305.03207  [pdf, other

    cs.CL cs.AI

    Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

    Authors: Sonal Sannigrahi, Rachel Bawden

    Abstract: Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati,… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: EAMT main conference

  5. arXiv:2303.01911  [pdf, ps, other

    cs.CL

    Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM

    Authors: Rachel Bawden, François Yvon

    Abstract: The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from ov… ▽ More

    Submitted 9 May, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: Accepted at EAMT 2023

  6. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

    Authors: Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

    Abstract: One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training d… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023

  7. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  8. arXiv:2205.12394  [pdf, other

    cs.CL

    MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification

    Authors: Yu Lu Liu, Rachel Bawden, Thomas Scialom, Benoît Sagot, Jackie Chi Kit Cheung

    Abstract: In text summarization and simplification, system outputs must be evaluated along multiple dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide range of possible outputs could be of high quality. These properties make the development of an adaptable, reference-less evaluation metric both necessary and challenging. We introduce MaskEval, a reference-less metric… ▽ More

    Submitted 13 October, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

  9. arXiv:2202.09452  [pdf, other

    cs.CL

    From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

    Authors: Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot

    Abstract: Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we prese… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: 8 pages, 2 figures, 4 tables

  10. arXiv:2110.08207  [pdf, other

    cs.LG cs.CL

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Authors: Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen , et al. (16 additional authors not shown)

    Abstract: Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale,… ▽ More

    Submitted 17 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: ICLR 2022 Spotlight (with extended discussion)

  11. arXiv:2109.00486  [pdf, other

    cs.CL

    Survey of Low-Resource Machine Translation

    Authors: Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, **dřich Helcl, Alexandra Birch

    Abstract: We present a survey covering the state of the art in low-resource machine translation research. There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated train… ▽ More

    Submitted 7 February, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

  12. arXiv:2103.16911  [pdf, other

    cs.CL

    Few-shot learning through contextual data augmentation

    Authors: Farid Arthaud, Rachel Bawden, Alexandra Birch

    Abstract: Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

    Comments: 14 pages includince 3 of appendices

  13. arXiv:2004.14989  [pdf, ps, other

    cs.CL

    A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing

    Authors: Rachel Bawden, Biao Zhang, Lisa Yankovskaya, Andre Tättar, Matt Post

    Abstract: We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language direction… ▽ More

    Submitted 8 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: Accepted in the Findings of EMNLP 2020

  14. arXiv:1912.06598  [pdf, other

    cs.CL

    Document Sub-structure in Neural Machine Translation

    Authors: Radina Dobreva, Jie Zhou, Rachel Bawden

    Abstract: Current approaches to machine translation (MT) either translate sentences in isolation, disregarding the context they appear in, or model context at the level of the full document, without a notion of any internal structure the document may have. In this work we consider the fact that documents are rarely homogeneous blocks of text, but rather consist of parts covering different topics. Some docum… ▽ More

    Submitted 10 March, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: Accepted at LREC 2020

  15. arXiv:1907.05854  [pdf, other

    cs.CL

    The University of Edinburgh's Submissions to the WMT19 News Translation Task

    Authors: Rachel Bawden, Nikolay Bogoychev, Ulrich Germann, Roman Grundkiewicz, Faheem Kirefu, Antonio Valerio Miceli Barone, Alexandra Birch

    Abstract: The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English-to-Gujarati, Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English, and English-to-Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English-… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: To appear in the Proceedings of WMT19: Shared Task Papers

  16. arXiv:1905.13354  [pdf, other

    cs.CL

    DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

    Authors: Rachel Bawden, Sophie Rosset, Thomas Lavergne, Eric Bilinski

    Abstract: We present a new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

  17. arXiv:1711.00513  [pdf, other

    cs.CL

    Evaluating Discourse Phenomena in Neural Machine Translation

    Authors: Rachel Bawden, Rico Sennrich, Alexandra Birch, Barry Haddow

    Abstract: For machine translation to tackle discourse phenomena, models must have access to extra-sentential linguistic context. There has been recent interest in modelling context in neural machine translation (NMT), but models have been principally evaluated with standard automatic metrics, poorly adapted to evaluating discourse phenomena. In this article, we present hand-crafted, discourse test sets, des… ▽ More

    Submitted 20 April, 2018; v1 submitted 1 November, 2017; originally announced November 2017.

    Comments: Final version of paper to appear in Proceedings of NAACL 2018