Skip to main content

Showing 1–14 of 14 results for author: Labaka, G

.
  1. arXiv:2205.12213  [pdf, other

    cs.CL

    Principled Paraphrase Generation with Parallel Corpora

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metri… ▽ More

    Submitted 23 May, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

  2. arXiv:2109.03659  [pdf, other

    cs.CL

    Label Verbalization and Entailment for Effective Zero- and Few-Shot Relation Extraction

    Authors: Oscar Sainz, Oier Lopez de Lacalle, Gorka Labaka, Ander Barrena, Eneko Agirre

    Abstract: Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation. The system relies on a pretrained textual entailment engine which is run as-is (no training examples, zero-shot) or further fine-tuned… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP2021

  3. Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

    Authors: Ivana Kvapilıkova, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

    Abstract: Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: ACL SRW 2020

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics - Student Research Workshop, pages 255-262, Association for Computational Linguistics, 2020

  4. Beyond Offline Map**: Learning Cross Lingual Word Embeddings through Context Anchoring

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Recent research on cross-lingual word embeddings has been dominated by unsupervised map** approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have… ▽ More

    Submitted 3 August, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: ACL 2021

  5. A Call for More Rigor in Unsupervised Cross-lingual Learning

    Authors: Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre

    Abstract: We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world's languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also dis… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Comments: ACL 2020

  6. Translation Artifacts in Cross-lingual Transfer Learning

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notab… ▽ More

    Submitted 14 December, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020

  7. Do all Roads Lead to Rome? Understanding the Role of Initialization in Iterative Back-Translation

    Authors: Mikel Artetxe, Gorka Labaka, Noe Casas, Eneko Agirre

    Abstract: Back-translation provides a simple yet effective approach to exploit monolingual corpora in Neural Machine Translation (NMT). Its iterative variant, where two opposite NMT models are jointly trained by alternately using a synthetic parallel corpus generated by the reverse model, plays a central role in unsupervised machine translation. In order to start producing sound translations and provide a m… ▽ More

    Submitted 28 February, 2020; originally announced February 2020.

  8. arXiv:1907.10761  [pdf, other

    cs.CL cs.AI cs.LG

    Bilingual Lexicon Induction through Unsupervised Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised m… ▽ More

    Submitted 24 July, 2019; originally announced July 2019.

    Comments: ACL 2019

  9. Analyzing the Limitations of Cross-lingual Word Embedding Map**s

    Authors: Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

    Abstract: Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure,… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  10. arXiv:1902.01313  [pdf, other

    cs.CL cs.AI cs.LG

    An Effective Approach to Unsupervised Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, develo** a… ▽ More

    Submitted 24 July, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: ACL 2019

  11. arXiv:1809.02094  [pdf, other

    cs.CL cs.AI cs.LG

    Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

    Authors: Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

    Abstract: Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that ad… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: CoNLL 2018

  12. arXiv:1809.01272  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Statistical Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In th… ▽ More

    Submitted 4 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  13. arXiv:1805.06297  [pdf, other

    cs.CL cs.AI cs.LG

    A robust self-learning method for fully unsupervised cross-lingual map**s of word embeddings

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre

    Abstract: Recent work has managed to learn cross-lingual word embeddings without parallel data by map** monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a… ▽ More

    Submitted 17 May, 2018; v1 submitted 16 May, 2018; originally announced May 2018.

    Comments: ACL 2018

  14. arXiv:1710.11041  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Neural Machine Translation

    Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre, Kyunghyun Cho

    Abstract: In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely re… ▽ More

    Submitted 26 February, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: Published as a conference paper at ICLR 2018