Skip to main content

Showing 1–18 of 18 results for author: España-Bonet, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.18830  [pdf, other

    cs.CL

    Translating away Translationese without Parallel Data

    Authors: Rricha Jalota, Koel Dutta Chowdhury, Cristina España-Bonet, Josef van Genabith

    Abstract: Translated texts exhibit systematic linguistic differences compared to original texts in the same language, and these differences are referred to as translationese. Translationese has effects on various cross-lingual natural language processing tasks, potentially leading to biased results. In this paper, we explore a novel approach to reduce translationese in translated texts: translation-based st… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023, Main Conference

  2. arXiv:2310.16269  [pdf, other

    cs.CL cs.AI cs.CY

    Multilingual Coarse Political Stance Classification of Media. The Editorial Line of a ChatGPT and Bard Newspaper

    Authors: Cristina España-Bonet

    Abstract: Neutrality is difficult to achieve and, in politics, subjective. Traditional media typically adopt an editorial line that can be used by their potential readers as an indicator of the media bias. Several platforms currently rate news outlets according to their political bias. The editorial line and the ratings help readers in gathering a balanced view of news. But in the advent of instruction-foll… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: To be published at EMNLP 2023 (Findings)

  3. arXiv:2308.13170  [pdf, other

    cs.CL

    Measuring Spurious Correlation in Classification: 'Clever Hans' in Translationese

    Authors: Angana Borah, Daria Pylypenko, Cristina Espana-Bonet, Josef van Genabith

    Abstract: Recent work has shown evidence of 'Clever Hans' behavior in high-performance neural translationese classifiers, where BERT-based classifiers capitalize on spurious correlations, in particular topic information, between data and target classification labels, rather than genuine translationese signals. Translationese signals are subtle (especially for professional translation) and compete with many… ▽ More

    Submitted 11 June, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: Accepted to RANLP 2023 (oral)

  4. arXiv:2305.14012  [pdf, other

    cs.CL

    When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages

    Authors: Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, Rachel Bawden

    Abstract: Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related hig… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 9 pages, Accepted at LREC-COLING 2024

  5. arXiv:2304.14796  [pdf, other

    cs.CL cs.IR

    Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

    Authors: Sonal Sannigrahi, Josef van Genabith, Cristina Espana-Bonet

    Abstract: Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back… ▽ More

    Submitted 28 April, 2023; originally announced April 2023.

    Comments: EACL 2023 Findings paper, to present at LoResMT

  6. arXiv:2210.13391  [pdf, other

    cs.CL

    Explaining Translationese: why are Neural Classifiers Better and what do they Learn?

    Authors: Kwabena Amponsah-Kaakyire, Daria Pylypenko, Josef van Genabith, Cristina España-Bonet

    Abstract: Recent work has shown that neural feature- and representation-learning, e.g. BERT, achieves superior performance over traditional manual feature engineering based approaches, with e.g. SVMs, in translationese classification tasks. Previous research did not show $(i)$ whether the difference is because of the features, the classifiers or both, and $(ii)$ what the neural classifiers actually learn. T… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 16 pages, 7 figures, 4 tables. The first 2 authors contributed equally. Accepted to BlackboxNLP 2022 (at EMNLP 2022)

  7. arXiv:2205.08814  [pdf, other

    cs.CL

    Exploiting Social Media Content for Self-Supervised Style Transfer

    Authors: Dana Ruiter, Thomas Kleinbauer, Cristina España-Bonet, Josef van Genabith, Dietrich Klakow

    Abstract: Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of self-supervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: 13 pages, 2 figures, accepted as a long paper at SocialNLP 2022 (@NAACL)

  8. arXiv:2205.08001  [pdf, other

    cs.CL

    Towards Debiasing Translation Artifacts

    Authors: Koel Dutta Chowdhury, Rricha Jalota, Cristina España-Bonet, Josef van Genabith

    Abstract: Cross-lingual natural language processing relies on translation, either by humans or machines, at different levels, from translating training data to translating test sets. However, compared to original texts in the same language, translations possess distinct qualities referred to as translationese. Previous research has shown that these translation artifacts influence the performance of a variet… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022, Main Conference

  9. arXiv:2109.07604  [pdf, other

    cs.CL

    Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification

    Authors: Daria Pylypenko, Kwabena Amponsah-Kaakyire, Koel Dutta Chowdhury, Josef van Genabith, Cristina España-Bonet

    Abstract: Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Comments: 9 pages, 5 pages appendix, 2 figures, 7 tables. The first 3 authors contributed equally. Accepted to EMNLP 2021, Main Conference

  10. arXiv:2107.08772  [pdf, other

    cs.CL

    Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

    Authors: Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet

    Abstract: For most language combinations, parallel data is either scarce or simply unavailable. To address this, unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising, while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To date, the inclusi… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

    Comments: 11 pages, 8 figures, accepted at MT-Summit 2021 (Research Track)

  11. arXiv:2103.08647  [pdf, other

    cs.CL

    The Effect of Domain and Diacritics in Yorùbá-English Neural Machine Translation

    Authors: David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina España-Bonet

    Abstract: Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation… ▽ More

    Submitted 14 August, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

    Comments: Accepted to MT Summit 2021 (Research Track)

  12. arXiv:2005.01177  [pdf, other

    cs.CL cs.IR

    Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

    Authors: Cristina España-Bonet, Alberto Barrón-Cedeño, Lluís Màrquez

    Abstract: We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743… ▽ More

    Submitted 3 May, 2020; originally announced May 2020.

    Comments: 26 pages, 8 figures, 6 tables

  13. arXiv:2004.03151  [pdf, other

    cs.CL

    Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation

    Authors: Dana Ruiter, Josef van Genabith, Cristina España-Bonet

    Abstract: Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to… ▽ More

    Submitted 6 October, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

    Comments: 12 pages, 5 images, to be published at EMNLP2020

  14. arXiv:1912.04778  [pdf, other

    cs.CL

    GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

    Authors: Marta R. Costa-jussà, Pau Li Lin, Cristina España-Bonet

    Abstract: We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpu… ▽ More

    Submitted 10 December, 2019; originally announced December 2019.

  15. arXiv:1912.02481  [pdf, ps, other

    cs.CL

    Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

    Authors: Jesujoba O. Alabi, Kwabena Amponsah-Kaakyire, David I. Adelani, Cristina España-Bonet

    Abstract: The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and… ▽ More

    Submitted 28 March, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: 9 pages, 4 tables. Accepted at LREC 2020

  16. arXiv:1911.01188  [pdf, other

    cs.CL

    Analysing Coreference in Transformer Outputs

    Authors: Ekaterina Lapshinova-Koltunski, Cristina España-Bonet, Josef van Genabith

    Abstract: We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference c… ▽ More

    Submitted 4 November, 2019; originally announced November 2019.

    Comments: 12 pages, 1 figure

    Journal ref: Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

  17. An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

    Authors: Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño, Josef van Genabith

    Abstract: End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. F… ▽ More

    Submitted 15 November, 2017; v1 submitted 18 April, 2017; originally announced April 2017.

    Comments: 11 pages, 4 figures

    Journal ref: IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1340-1350, December 2017

  18. arXiv:1608.01910  [pdf, ps, other

    cs.CL

    Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation

    Authors: Pranava Swaroop Madhyastha, Cristina España-Bonet

    Abstract: Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source word, the model generates a probabilistic list of pos… ▽ More

    Submitted 5 August, 2016; originally announced August 2016.

    Comments: 6 pages, 3 tables