Skip to main content

Showing 1–23 of 23 results for author: España-Bonet, C

.
  1. arXiv:2310.18830  [pdf, other

    cs.CL

    Translating away Translationese without Parallel Data

    Authors: Rricha Jalota, Koel Dutta Chowdhury, Cristina España-Bonet, Josef van Genabith

    Abstract: Translated texts exhibit systematic linguistic differences compared to original texts in the same language, and these differences are referred to as translationese. Translationese has effects on various cross-lingual natural language processing tasks, potentially leading to biased results. In this paper, we explore a novel approach to reduce translationese in translated texts: translation-based st… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023, Main Conference

  2. arXiv:2310.16269  [pdf, other

    cs.CL cs.AI cs.CY

    Multilingual Coarse Political Stance Classification of Media. The Editorial Line of a ChatGPT and Bard Newspaper

    Authors: Cristina España-Bonet

    Abstract: Neutrality is difficult to achieve and, in politics, subjective. Traditional media typically adopt an editorial line that can be used by their potential readers as an indicator of the media bias. Several platforms currently rate news outlets according to their political bias. The editorial line and the ratings help readers in gathering a balanced view of news. But in the advent of instruction-foll… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: To be published at EMNLP 2023 (Findings)

  3. arXiv:2308.13170  [pdf, other

    cs.CL

    Measuring Spurious Correlation in Classification: 'Clever Hans' in Translationese

    Authors: Angana Borah, Daria Pylypenko, Cristina Espana-Bonet, Josef van Genabith

    Abstract: Recent work has shown evidence of 'Clever Hans' behavior in high-performance neural translationese classifiers, where BERT-based classifiers capitalize on spurious correlations, in particular topic information, between data and target classification labels, rather than genuine translationese signals. Translationese signals are subtle (especially for professional translation) and compete with many… ▽ More

    Submitted 11 June, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: Accepted to RANLP 2023 (oral)

  4. arXiv:2305.14012  [pdf, other

    cs.CL

    When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages

    Authors: Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, Rachel Bawden

    Abstract: Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related hig… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 9 pages, Accepted at LREC-COLING 2024

  5. arXiv:2304.14796  [pdf, other

    cs.CL cs.IR

    Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

    Authors: Sonal Sannigrahi, Josef van Genabith, Cristina Espana-Bonet

    Abstract: Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back… ▽ More

    Submitted 28 April, 2023; originally announced April 2023.

    Comments: EACL 2023 Findings paper, to present at LoResMT

  6. arXiv:2210.13391  [pdf, other

    cs.CL

    Explaining Translationese: why are Neural Classifiers Better and what do they Learn?

    Authors: Kwabena Amponsah-Kaakyire, Daria Pylypenko, Josef van Genabith, Cristina España-Bonet

    Abstract: Recent work has shown that neural feature- and representation-learning, e.g. BERT, achieves superior performance over traditional manual feature engineering based approaches, with e.g. SVMs, in translationese classification tasks. Previous research did not show $(i)$ whether the difference is because of the features, the classifiers or both, and $(ii)$ what the neural classifiers actually learn. T… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 16 pages, 7 figures, 4 tables. The first 2 authors contributed equally. Accepted to BlackboxNLP 2022 (at EMNLP 2022)

  7. arXiv:2205.08814  [pdf, other

    cs.CL

    Exploiting Social Media Content for Self-Supervised Style Transfer

    Authors: Dana Ruiter, Thomas Kleinbauer, Cristina España-Bonet, Josef van Genabith, Dietrich Klakow

    Abstract: Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of self-supervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: 13 pages, 2 figures, accepted as a long paper at SocialNLP 2022 (@NAACL)

  8. arXiv:2205.08001  [pdf, other

    cs.CL

    Towards Debiasing Translation Artifacts

    Authors: Koel Dutta Chowdhury, Rricha Jalota, Cristina España-Bonet, Josef van Genabith

    Abstract: Cross-lingual natural language processing relies on translation, either by humans or machines, at different levels, from translating training data to translating test sets. However, compared to original texts in the same language, translations possess distinct qualities referred to as translationese. Previous research has shown that these translation artifacts influence the performance of a variet… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022, Main Conference

  9. arXiv:2109.07604  [pdf, other

    cs.CL

    Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification

    Authors: Daria Pylypenko, Kwabena Amponsah-Kaakyire, Koel Dutta Chowdhury, Josef van Genabith, Cristina España-Bonet

    Abstract: Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Comments: 9 pages, 5 pages appendix, 2 figures, 7 tables. The first 3 authors contributed equally. Accepted to EMNLP 2021, Main Conference

  10. arXiv:2107.08772  [pdf, other

    cs.CL

    Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

    Authors: Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet

    Abstract: For most language combinations, parallel data is either scarce or simply unavailable. To address this, unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising, while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To date, the inclusi… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

    Comments: 11 pages, 8 figures, accepted at MT-Summit 2021 (Research Track)

  11. arXiv:2103.08647  [pdf, other

    cs.CL

    The Effect of Domain and Diacritics in Yorùbá-English Neural Machine Translation

    Authors: David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina España-Bonet

    Abstract: Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation… ▽ More

    Submitted 14 August, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

    Comments: Accepted to MT Summit 2021 (Research Track)

  12. arXiv:2005.01177  [pdf, other

    cs.CL cs.IR

    Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

    Authors: Cristina España-Bonet, Alberto Barrón-Cedeño, Lluís Màrquez

    Abstract: We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743… ▽ More

    Submitted 3 May, 2020; originally announced May 2020.

    Comments: 26 pages, 8 figures, 6 tables

  13. arXiv:2004.03151  [pdf, other

    cs.CL

    Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation

    Authors: Dana Ruiter, Josef van Genabith, Cristina España-Bonet

    Abstract: Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to… ▽ More

    Submitted 6 October, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

    Comments: 12 pages, 5 images, to be published at EMNLP2020

  14. arXiv:1912.04778  [pdf, other

    cs.CL

    GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

    Authors: Marta R. Costa-jussà, Pau Li Lin, Cristina España-Bonet

    Abstract: We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpu… ▽ More

    Submitted 10 December, 2019; originally announced December 2019.

  15. arXiv:1912.02481  [pdf, ps, other

    cs.CL

    Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

    Authors: Jesujoba O. Alabi, Kwabena Amponsah-Kaakyire, David I. Adelani, Cristina España-Bonet

    Abstract: The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and… ▽ More

    Submitted 28 March, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: 9 pages, 4 tables. Accepted at LREC 2020

  16. arXiv:1911.01188  [pdf, other

    cs.CL

    Analysing Coreference in Transformer Outputs

    Authors: Ekaterina Lapshinova-Koltunski, Cristina España-Bonet, Josef van Genabith

    Abstract: We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference c… ▽ More

    Submitted 4 November, 2019; originally announced November 2019.

    Comments: 12 pages, 1 figure

    Journal ref: Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

  17. An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

    Authors: Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño, Josef van Genabith

    Abstract: End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. F… ▽ More

    Submitted 15 November, 2017; v1 submitted 18 April, 2017; originally announced April 2017.

    Comments: 11 pages, 4 figures

    Journal ref: IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1340-1350, December 2017

  18. arXiv:1608.01910  [pdf, ps, other

    cs.CL

    Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation

    Authors: Pranava Swaroop Madhyastha, Cristina España-Bonet

    Abstract: Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source word, the model generates a probabilistic list of pos… ▽ More

    Submitted 5 August, 2016; originally announced August 2016.

    Comments: 6 pages, 3 tables

  19. Tracing the equation of state and the density of cosmological constant along z

    Authors: Cristina Espana-Bonet, Pilar Ruiz-Lapuente

    Abstract: We investigate the equation of state w(z) in a non-parametric form using the latest compilations of distance luminosity from SNe Ia at high z. We combine the inverse problem approach with a Monte Carlo to scan the space of priors. On the light of these high redshift supernova data sets, we reconstruct w(z). A comparison between a sample including the latest results at z>1 and a sample without th… ▽ More

    Submitted 13 May, 2008; originally announced May 2008.

    Comments: 13 pages, 6 figures

    Journal ref: JCAP 0802:018,2008

  20. Type Ia SNe along redshift: the R(Si II) ratio and the expansion velocities in intermediate z supernovae

    Authors: G. Altavilla, P. Ruiz-Lapuente, A. Balastegui, J. Mendez, M. Irwin, C. Espana-Bonet, K. Schamaneche, C. Balland, R. S. Ellis, S. Fabbro, G. Folatelli, A. Goobar, W. Hillebrandt, R. M. McMahon, M. Mouchet, A. Mourao, S. Nobili, R. Pain, V. Stanishev, N. A. Walton

    Abstract: We study intermediate--z SNe Ia using the empirical physical diagrams which enable to learn about those SNe explosions. This information can be very useful to reduce systematic uncertainties of the Hubble diagram of SNe Ia up to high z. The study of the expansion velocities and the measurement of the ratio $\mathcal{R}$(\SiII) allow to subtype those SNe Ia as done for nearby samples. The evoluti… ▽ More

    Submitted 5 October, 2006; originally announced October 2006.

    Comments: 55 pages, 22 figures, submitted to The Astrophysical Journal (figures reduced for astro-ph)

    Journal ref: Astrophys.J.695:135-148,2009

  21. arXiv:hep-ph/0503210  [pdf, ps, other

    hep-ph astro-ph

    Dark Energy as an Inverse Problem

    Authors: Cristina Espana-Bonet, Pilar Ruiz-Lapuente

    Abstract: A model--independent approach to dark energy is here developed by considering the determination of its equation of state as an inverse problem. The reconstruction of w(z) as a non--parametric function using the current SNe Ia data is explored. It is investigated as well how results would improve when considering other samples of cosmic distance indicators at higher redshift. This approach reveal… ▽ More

    Submitted 23 June, 2005; v1 submitted 22 March, 2005; originally announced March 2005.

    Comments: 5 figures and 3 tables, submitted to PRD

  22. arXiv:hep-ph/0311171  [pdf, ps, other

    hep-ph astro-ph gr-qc hep-th

    Testing the running of the cosmological constant with Type Ia Supernovae at high z

    Authors: Cristina Espana-Bonet, Pilar Ruiz-Lapuente, Ilya L. Shapiro, Joan Sola

    Abstract: Within the Quantum Field Theory context the idea of a "cosmological constant" (CC) evolving with time looks quite natural as it just reflects the change of the vacuum energy with the typical energy of the universe. In the particular frame of Ref.[30], a "running CC" at low energies may arise from generic quantum effects near the Planck scale, M_P, provided there is a smooth decoupling of all mas… ▽ More

    Submitted 10 February, 2004; v1 submitted 13 November, 2003; originally announced November 2003.

    Comments: LaTeX, 51 pages, 13 figures, 1 table, references added, typos corrected, version accepted in JCAP

    Report number: UB-ECM-PF-03/06

    Journal ref: JCAP0402:006,2004

  23. arXiv:astro-ph/0303306  [pdf, ps, other

    astro-ph gr-qc hep-ph hep-th

    Variable Cosmological Constant as a Planck Scale Effect

    Authors: Ilya L. Shapiro, Joan Sola, Cristina Espana-Bonet, Pilar Ruiz-Lapuente

    Abstract: We construct a semiclassical FLRW cosmological model assuming a running cosmological constant (CC). It turns out that the CC becomes variable at arbitrarily low energies due to the remnant quantum effects of the heaviest particles, e.g. the Planck scale physics. These effects are universal in the sense that they lead to a low-energy structure common to a large class of high-energy theories. Rema… ▽ More

    Submitted 4 September, 2003; v1 submitted 13 March, 2003; originally announced March 2003.

    Comments: Added one more figure, comments and references. Version accepted in Phys. Lett. B

    Report number: DF/UFJF-03/01, UB-ECM-PF 03/05

    Journal ref: Phys.Lett.B574:149-155,2003