Skip to main content

Showing 1–47 of 47 results for author: Artetxe, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.07302  [pdf, ps, other

    cs.CL cs.AI cs.LG

    BertaQA: How Much Do Language Models Know About Local Culture?

    Authors: Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

    Abstract: Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English an… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  2. arXiv:2405.02287  [pdf, other

    cs.CL cs.AI cs.CV

    Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

    Authors: Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay

    Abstract: We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing a… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  3. arXiv:2404.12387  [pdf, other

    cs.CL cs.CV

    Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

    Authors: Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu , et al. (1 additional authors not shown)

    Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but al… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  4. arXiv:2403.20266  [pdf, other

    cs.CL cs.AI cs.LG

    Latxa: An Open Language Model and Evaluation Suite for Basque

    Authors: Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

    Abstract: We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  5. arXiv:2309.03175  [pdf, other

    cs.CL

    Gender-specific Machine Translation with Large Language Models

    Authors: Eduardo Sánchez, Pierre Andrews, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà

    Abstract: While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the a… ▽ More

    Submitted 16 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

  6. arXiv:2308.16884  [pdf, other

    cs.CL cs.AI cs.LG

    The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

    Authors: Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

    Abstract: We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multip… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: 27 pages, 13 figures

    ACM Class: I.2.7

  7. arXiv:2308.12157  [pdf, other

    cs.CL cs.AI

    Evaluation of Faithfulness Using the Longest Supported Subsequence

    Authors: Anirudh Mittal, Timo Schick, Mikel Artetxe, Jane Dwivedi-Yu

    Abstract: As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-gener… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  8. arXiv:2308.01223  [pdf, other

    cs.CL cs.AI cs.LG

    Do Multilingual Language Models Think Better in English?

    Authors: Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

    Abstract: Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system, and running inference over the translated input. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel da… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  9. arXiv:2307.01163  [pdf, other

    cs.CL cs.LG cs.NE

    Improving Language Plasticity via Pretraining with Active Forgetting

    Authors: Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Ifeoluwa Adelani, Pontus Stenetorp, Sebastian Riedel, Mikel Artetxe

    Abstract: Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data an… ▽ More

    Submitted 12 January, 2024; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023 Final Version

  10. arXiv:2305.16876  [pdf, other

    cs.CL

    CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models

    Authors: Aitor Ormazabal, Mikel Artetxe, Eneko Agirre

    Abstract: Methods for adapting language models (LMs) to new tasks and domains have traditionally assumed white-box access to the model, and work by modifying its parameters. However, this is incompatible with a recent trend in the field, where the highest quality models are only available as black-boxes through inference APIs. Even when the model weights are available, the computational cost of fine-tuning… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: This previously appeared as arXiv:2205.12213v2, which was submitted as new by mistake

  11. arXiv:2305.14240  [pdf, other

    cs.CL cs.AI cs.LG

    Revisiting Machine Translation for Cross-lingual Classification

    Authors: Mikel Artetxe, Vedanuj Goswami, Shruti Bhosale, Angela Fan, Luke Zettlemoyer

    Abstract: Machine Translation (MT) has been widely used for cross-lingual classification, either by translating the test set into English and running inference with a monolingual model (translate-test), or translating the training set into the target languages and finetuning a multilingual model (translate-train). However, most research in the area focuses on the multilingual models rather than the MT compo… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  12. arXiv:2212.10503  [pdf, other

    cs.CL cs.LG

    Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

    Authors: Kelly Marchisio, Patrick Lewis, Yihong Chen, Mikel Artetxe

    Abstract: Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while kee** the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. We propose mini-model adaptation… ▽ More

    Submitted 4 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Findings of ACL 2023 Camera Ready

  13. arXiv:2212.10173  [pdf, other

    cs.CL

    On the Role of Parallel Data in Cross-lingual Transfer Learning

    Authors: Machel Reid, Mikel Artetxe

    Abstract: While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold par… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: Preprint

  14. arXiv:2212.09803  [pdf, other

    cs.CL cs.AI cs.LG

    Training Trajectories of Language Models Across Scales

    Authors: Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov

    Abstract: Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et… ▽ More

    Submitted 29 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023; The code and analysis results are available at https://github.com/xiamengzhou/training_trajectory_analysis

  15. arXiv:2210.14803  [pdf, other

    cs.CL cs.AI cs.LG

    Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

    Authors: Mozes van de Kar, Mengzhou Xia, Danqi Chen, Mikel Artetxe

    Abstract: Masked language models like BERT can perform text classification in a zero-shot fashion by reformulating downstream tasks as text infilling. However, this approach is highly sensitive to the template used to prompt the model, yet practitioners are blind when designing them in strict zero-shot settings. In this paper, we propose an alternative mining-based approach for zero-shot learning. Instead o… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  16. State-of-the-art generalisation research in NLP: A taxonomy and review

    Authors: Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhi**g **

    Abstract: The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation… ▽ More

    Submitted 12 January, 2024; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: This preprint was published as an Analysis article in Nature Machine Intelligence. Please refer to the published version when citing this work. 28 pages of content + 6 pages of appendix + 52 pages of references

    Journal ref: Nat Mach Intell 5, 1161-1174 (2023)

  17. arXiv:2205.15223  [pdf, other

    cs.CL cs.LG

    Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

    Authors: Mengzhou Xia, Mikel Artetxe, **gfei Du, Danqi Chen, Ves Stoyanov

    Abstract: Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks.… ▽ More

    Submitted 26 October, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: Accepted to EMNLP 2022; The code is available at https://github.com/facebookresearch/ELECTRA-Fewshot-Learning

  18. arXiv:2205.12213  [pdf, other

    cs.CL

    Principled Paraphrase Generation with Parallel Corpora

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metri… ▽ More

    Submitted 23 May, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

  19. arXiv:2205.12206  [pdf, other

    cs.CL cs.AI

    PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

    Authors: Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, Eneko Agirre

    Abstract: Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems following any given meter and rhyme scheme, without requiring any poetic text for training.… ▽ More

    Submitted 28 October, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: EMNLP Findings 2022

  20. arXiv:2205.11726  [pdf, other

    cs.CL cs.AI cs.LG

    On the Role of Bidirectionality in Language Model Pre-Training

    Authors: Mikel Artetxe, **gfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov

    Abstract: Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot pr… ▽ More

    Submitted 26 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: Findings of EMNLP 2022

  21. arXiv:2205.10835  [pdf, other

    cs.CL

    Multilingual Machine Translation with Hyper-Adapters

    Authors: Christos Baziotis, Mikel Artetxe, James Cross, Shruti Bhosale

    Abstract: Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks… ▽ More

    Submitted 5 December, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022 camera-ready version. Code at github.com/cbaziotis/fairseq under the "hyperadapters" branch (see instructions at https://github.com/cbaziotis/fairseq/tree/hyperadapters/examples/adapters)

  22. arXiv:2205.06266  [pdf, other

    cs.CL

    Lifting the Curse of Multilinguality by Pre-training Modular Transformers

    Authors: Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe

    Abstract: Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while kee** the total number of trainable parameters per language constant. In contrast with prior work that learn… ▽ More

    Submitted 12 May, 2022; originally announced May 2022.

    Comments: NAACL 2022

  23. arXiv:2205.01068  [pdf, other

    cs.CL cs.LG

    OPT: Open Pre-trained Transformer Language Models

    Authors: Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

    Abstract: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open… ▽ More

    Submitted 21 June, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

  24. arXiv:2203.08111  [pdf, other

    cs.CL cs.AI cs.LG

    Does Corpus Quality Really Matter for Low-Resource Languages?

    Authors: Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

    Abstract: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scra** websites wit… ▽ More

    Submitted 26 October, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: EMNLP 2022

  25. arXiv:2203.06850  [pdf, other

    cs.CL cs.AI

    Efficient Language Modeling with Sparse all-MLP

    Authors: ** Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

    Abstract: All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input… ▽ More

    Submitted 31 May, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

  26. arXiv:2202.12837  [pdf, other

    cs.CL cs.AI

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Authors: Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer

    Abstract: Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations a… ▽ More

    Submitted 20 October, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

    Comments: 17 pages; 12 figures. Published as a conference paper at EMNLP 2022 (long). Code available at https://github.com/Alrope123/rethinking-demonstrations

  27. arXiv:2112.10684  [pdf, other

    cs.CL cs.AI cs.LG

    Efficient Large Scale Language Modeling with Mixtures of Experts

    Authors: Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, **gfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov

    Abstract: Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we… ▽ More

    Submitted 26 October, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: EMNLP 2022

  28. arXiv:2112.10668  [pdf, other

    cs.CL cs.AI

    Few-shot Learning with Multilingual Language Models

    Authors: Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, **gfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

    Abstract: Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study t… ▽ More

    Submitted 10 November, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: Accepted to EMNLP 2022; 34 pages

  29. arXiv:2108.01887  [pdf, other

    cs.CL cs.LG

    PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

    Authors: Machel Reid, Mikel Artetxe

    Abstract: Despite the success of multilingual sequence-to-sequence pretraining, most existing approaches rely on monolingual corpora, and do not make use of the strong cross-lingual signal contained in parallel data. In this paper, we present PARADISE (PARAllel & Denoising Integration in SEquence-to-sequence models), which extends the conventional denoising objective used to train these models by (i) replac… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

    Comments: Preprint

  30. Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

    Authors: Ivana Kvapilıkova, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

    Abstract: Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: ACL SRW 2020

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics - Student Research Workshop, pages 255-262, Association for Computational Linguistics, 2020

  31. arXiv:2103.12528  [pdf, other

    cs.CL cs.AI stat.ML

    Multilingual Autoregressive Entity Linking

    Authors: Nicola De Cao, Ledell Wu, Kashyap Popat, Mikel Artetxe, Naman Goyal, Mikhail Plekhanov, Luke Zettlemoyer, Nicola Cancedda, Sebastian Riedel, Fabio Petroni

    Abstract: We present mGENRE, a sequence-to-sequence system for the Multilingual Entity Linking (MEL) problem -- the task of resolving language-specific mentions to a multilingual Knowledge Base (KB). For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token in an autoregressive fashion. The autoregressive formulation allows us to effectively cross-encode… ▽ More

    Submitted 23 March, 2021; originally announced March 2021.

    Comments: 20 pages, 8 figures, and 11 tables

  32. Beyond Offline Map**: Learning Cross Lingual Word Embeddings through Context Anchoring

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Recent research on cross-lingual word embeddings has been dominated by unsupervised map** approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have… ▽ More

    Submitted 3 August, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: ACL 2021

  33. arXiv:2006.01594  [pdf, other

    cs.CL

    Training Multilingual Machine Translation by Alternately Freezing Language-Specific Encoders-Decoders

    Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

    Abstract: We propose a modular architecture of language-specific encoder-decoders that constitutes a multilingual machine translation system that can be incrementally extended to new languages without the need for retraining the existing system when adding new languages. Differently from previous works, we simultaneously train $N$ languages in all translation directions by alternately freezing encoder or de… ▽ More

    Submitted 29 May, 2020; originally announced June 2020.

    Comments: arXiv admin note: text overlap with arXiv:2004.06575

    ACM Class: I.2.7

  34. A Call for More Rigor in Unsupervised Cross-lingual Learning

    Authors: Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre

    Abstract: We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world's languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also dis… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Comments: ACL 2020

  35. arXiv:2004.06575  [pdf, ps, other

    cs.CL

    Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

    Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

    Abstract: State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua represe… ▽ More

    Submitted 14 April, 2020; originally announced April 2020.

    ACM Class: I.2.7

  36. Translation Artifacts in Cross-lingual Transfer Learning

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notab… ▽ More

    Submitted 14 December, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020

  37. Do all Roads Lead to Rome? Understanding the Role of Initialization in Iterative Back-Translation

    Authors: Mikel Artetxe, Gorka Labaka, Noe Casas, Eneko Agirre

    Abstract: Back-translation provides a simple yet effective approach to exploit monolingual corpora in Neural Machine Translation (NMT). Its iterative variant, where two opposite NMT models are jointly trained by alternately using a synthetic parallel corpus generated by the reverse model, plays a central role in unsupervised machine translation. In order to start producing sound translations and provide a m… ▽ More

    Submitted 28 February, 2020; originally announced February 2020.

  38. On the Cross-lingual Transferability of Monolingual Representations

    Authors: Mikel Artetxe, Sebastian Ruder, Dani Yogatama

    Abstract: State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages giving rise to deep multilingual abstractions. We evaluate this hypothesis by designing an alternative approach that tran… ▽ More

    Submitted 26 May, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: ACL 2020

  39. arXiv:1907.10761  [pdf, other

    cs.CL cs.AI cs.LG

    Bilingual Lexicon Induction through Unsupervised Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised m… ▽ More

    Submitted 24 July, 2019; originally announced July 2019.

    Comments: ACL 2019

  40. Analyzing the Limitations of Cross-lingual Word Embedding Map**s

    Authors: Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

    Abstract: Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure,… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  41. arXiv:1902.01313  [pdf, other

    cs.CL cs.AI cs.LG

    An Effective Approach to Unsupervised Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, develo** a… ▽ More

    Submitted 24 July, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: ACL 2019

  42. arXiv:1812.10464  [pdf, other

    cs.CL cs.AI cs.LG

    Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

    Authors: Mikel Artetxe, Holger Schwenk

    Abstract: We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifi… ▽ More

    Submitted 25 September, 2019; v1 submitted 26 December, 2018; originally announced December 2018.

    Comments: TACL

  43. arXiv:1811.01136  [pdf, other

    cs.CL cs.AI cs.LG

    Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

    Authors: Mikel Artetxe, Holger Schwenk

    Abstract: Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our… ▽ More

    Submitted 7 August, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

    Comments: ACL 2019

  44. arXiv:1809.02094  [pdf, other

    cs.CL cs.AI cs.LG

    Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

    Authors: Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

    Abstract: Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that ad… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: CoNLL 2018

  45. arXiv:1809.01272  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Statistical Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In th… ▽ More

    Submitted 4 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  46. arXiv:1805.06297  [pdf, other

    cs.CL cs.AI cs.LG

    A robust self-learning method for fully unsupervised cross-lingual map**s of word embeddings

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: Recent work has managed to learn cross-lingual word embeddings without parallel data by map** monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a… ▽ More

    Submitted 17 May, 2018; v1 submitted 16 May, 2018; originally announced May 2018.

    Comments: ACL 2018

  47. arXiv:1710.11041  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Neural Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre, Kyunghyun Cho

    Abstract: In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely re… ▽ More

    Submitted 26 February, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: Published as a conference paper at ICLR 2018