Skip to main content

Showing 1–35 of 35 results for author: Agerri, R

.
  1. arXiv:2406.15227  [pdf, other

    cs.CL

    A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation

    Authors: Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

    Abstract: The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to achieve a higher correlation with human judgments, th… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2406.08201  [pdf, other

    cs.SI

    HTIM: Hybrid Text-Interaction Modeling for Broadening Political Leaning Inference in Social Media

    Authors: Joseba Fernandez de Landa, Arkaitz Zubiaga, Rodrigo Agerri

    Abstract: Political leaning can be defined as the inclination of an individual towards certain political orientations that align with their personal beliefs. Political leaning inference has traditionally been framed as a binary classification problem, namely, to distinguish between left vs. right or conservative vs liberal. Furthermore, although some recent work considers political leaning inference in a mu… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.07964  [pdf, other

    cs.SI cs.CL cs.CY

    Political Leaning Inference through Plurinational Scenarios

    Authors: Joseba Fernandez de Landa, Rodrigo Agerri

    Abstract: Social media users express their political preferences via interaction with other users, by spontaneous declarations or by participation in communities within the network. This makes a social network such as Twitter a valuable data source to study computational science approaches to political learning inference. In this work we focus on three diverse regions in Spain (Basque Country, Catalonia and… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2404.07613  [pdf, other

    cs.CL cs.AI cs.LG

    Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

    Authors: Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata, Andrea Zaninello

    Abstract: Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts be… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: LREC-COLING 2024

  5. arXiv:2404.07053  [pdf, other

    cs.CL cs.AI cs.LG

    Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

    Authors: Elisa Sanchez-Bayona, Rodrigo Agerri

    Abstract: Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigat… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  6. arXiv:2404.05590  [pdf, other

    cs.CL

    MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

    Authors: Iñigo Alonso, Maite Oronoz, Rodrigo Agerri

    Abstract: Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged b… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  7. arXiv:2403.16968  [pdf, ps, other

    cs.CL

    Evaluating Shortest Edit Script Methods for Contextual Lemmatization

    Authors: Olia Toporkov, Rodrigo Agerri

    Abstract: Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES), namely, the number of edit operations to transform a word form into its lemma. In fact, different methods of computing SES have been proposed as an integral component in the architecture of several state-of-the-art contextual lemmatizers currently available. However, previous work has not investigated th… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 13 pages, 7 tables. Accepted to LREC-COLING 2024

  8. arXiv:2403.09159  [pdf, ps, other

    cs.CL

    Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation

    Authors: Jaione Bengoetxea, Yi-Ling Chung, Marco Guerini, Rodrigo Agerri

    Abstract: Counter Narratives (CNs) are non-negative textual responses to Hate Speech (HS) aiming at defusing online hatred and mitigating its spreading across media. Despite the recent increase in HS content posted online, research on automatic CN generation has been relatively scarce and predominantly focused on English. In this paper, we present CONAN-EUS, a new Basque and Spanish dataset for CN generatio… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted for the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) 2024

  9. arXiv:2312.01738  [pdf, other

    cs.SI cs.CY

    Generalizing Political Leaning Inference to Multi-Party Systems: Insights from the UK Political Landscape

    Authors: Joseba Fernandez de Landa, Arkaitz Zubiaga, Rodrigo Agerri

    Abstract: An ability to infer the political leaning of social media users can help in gathering opinion polls thereby leading to a better understanding of public opinion. While there has been a body of research attempting to infer the political leaning of social media users, this has been typically simplified as a binary classification problem (e.g. left vs right) and has been limited to a single location,… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  10. arXiv:2312.00567  [pdf, other

    cs.CL

    Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

    Authors: Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Maite Oronoz, Rodrigo Agerri

    Abstract: Develo** the required technology to assist medical experts in their everyday activities is currently a hot topic in the Artificial Intelligence research field. Thus, a number of large language models (LLMs) and automated benchmarks have recently been proposed with the aim of facilitating information extraction in Evidence-Based Medicine (EBM) using natural language as a tool for mediating in hum… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  11. arXiv:2311.14727  [pdf, other

    cs.CL cs.LG

    Optimal Strategies to Perform Multilingual Analysis of Social Content for a Novel Dataset in the Tourism Domain

    Authors: Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle, Philippe Roose

    Abstract: The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated natural language processing (NLP) approaches to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge often poses significant challenges. In this work, we eval… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

  12. arXiv:2310.03668  [pdf, other

    cs.CL

    GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction

    Authors: Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, Eneko Agirre

    Abstract: Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines that describe the task and give examples to humans. Previous attempts to leverage such infor… ▽ More

    Submitted 6 March, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: The Twelfth International Conference on Learning Representations - ICLR 2024

  13. arXiv:2306.06029  [pdf, other

    cs.CL cs.AI

    HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine

    Authors: Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau, Anar Yeginbergenova

    Abstract: Providing high quality explanations for AI predictions based on machine learning is a challenging and complex task. To work well it requires, among other factors: selecting a proper level of generality/specificity of the explanation; considering assumptions about the familiarity of the explanation beneficiary with the AI task under consideration; referring to specific elements that have contribute… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: To appear: In SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing

  14. arXiv:2304.14221  [pdf, other

    cs.CL

    A Modular Approach for Multilingual Timex Detection and Normalization using Deep Learning and Grammar-based methods

    Authors: Nayla Escribano, German Rigau, Rodrigo Agerri

    Abstract: Detecting and normalizing temporal expressions is an essential step for many NLP tasks. While a variety of methods have been proposed for detection, best normalization approaches rely on hand-crafted rules. Furthermore, most of them have been designed only for English. In this paper we present a modular multilingual temporal processing system combining a fine-tuned Masked Language Model for detect… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

  15. arXiv:2302.00407  [pdf, other

    cs.CL cs.AI

    On the Role of Morphological Information for Contextual Lemmatization

    Authors: Olia Toporkov, Rodrigo Agerri

    Abstract: Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyn… ▽ More

    Submitted 20 October, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: 30 pages, 5 figures, 11 tables; Accepted for publication in Computational Linguistics journal (to appear)

  16. arXiv:2301.10527  [pdf, other

    cs.CL

    Cross-lingual Argument Mining in the Medical Domain

    Authors: Anar Yeginbergen, Rodrigo Agerri

    Abstract: Nowadays the medical domain is receiving more and more attention in applications involving Artificial Intelligence as clinicians decision-making is increasingly dependent on dealing with enormous amounts of unstructured textual data. In this context, Argument Mining (AM) helps to meaningfully structure textual data by identifying the argumentative components in the text and classifying the relatio… ▽ More

    Submitted 5 April, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

  17. arXiv:2212.10548  [pdf, other

    cs.CL

    T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks

    Authors: Iker García-Ferrero, Rodrigo Agerri, German Rigau

    Abstract: In the absence of readily available labeled data for a given sequence labeling task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data. Annotation projection has often been formulated as the task of transporting, on parallel corpora, the labels pertaining to a given span in the source language into its corresponding span… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Findings of the EMNLP 2023

  18. arXiv:2212.08390  [pdf, ps, other

    cs.CL cs.AI

    Lessons learned from the evaluation of Spanish Language Models

    Authors: Rodrigo Agerri, Eneko Agirre

    Abstract: Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-… ▽ More

    Submitted 22 September, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

    Comments: 11 pages, three tables

    Journal ref: Procesamiento del Lenguaje Natural (70), pp 157-170, 2023

  19. arXiv:2210.12623  [pdf, other

    cs.CL

    Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings

    Authors: Iker García-Ferrero, Rodrigo Agerri, German Rigau

    Abstract: Zero-resource cross-lingual transfer approaches aim to apply supervised models from a source language to unlabelled target languages. In this paper we perform an in-depth study of the two main techniques employed so far for cross-lingual zero-resource sequence labelling, based either on data or model transfer. Although previous research has proposed translation and annotation projection (data-base… ▽ More

    Submitted 27 April, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

    Comments: Findings of the Association for Computational Linguistics: EMNLP 2022

    Journal ref: Findings of the Association for Computational Linguistics EMNLP 2022, 6403-6416

  20. arXiv:2210.10358  [pdf, other

    cs.CL

    Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection

    Authors: Elisa Sanchez-Bayona, Rodrigo Agerri

    Abstract: The lack of wide coverage datasets annotated with everyday metaphorical expressions for languages other than English is striking. This means that most research on supervised metaphor detection has been published only for that language. In order to address this issue, this work presents the first corpus annotated with naturally occurring metaphors in Spanish large enough to develop systems to perfo… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: To be published in CoNLL 2022

  21. arXiv:2210.05715  [pdf, other

    cs.CL cs.AI

    Relational Embeddings for Language Independent Stance Detection

    Authors: Joseba Fernandez de Landa, Rodrigo Agerri

    Abstract: The large majority of the research performed on stance detection has been focused on develo** more or less sophisticated text classification systems, even when many benchmarks are based on social network data such as Twitter. This paper aims to take on the stance detection task by placing the emphasis not so much on the text itself but on the interaction data available on social networks. More s… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  22. arXiv:2205.01506  [pdf, other

    cs.CL cs.AI

    BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions

    Authors: Nayla Escribano, Jon Ander González, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Simón Peña-Fernández, Olatz Perez-de-Viñaspre, Rodrigo Agerri

    Abstract: Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies. Furthermore, the political debates captured in these transcripts facilitate research on political discourse from a computational social science perspective. In this paper we release the first version of a newly compiled corpus from Basque pa… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: 9 pages, 14 figures, 4 tables. To be published in LREC 2022

  23. arXiv:2203.08111  [pdf, other

    cs.CL cs.AI cs.LG

    Does Corpus Quality Really Matter for Low-Resource Languages?

    Authors: Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

    Abstract: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scra** websites wit… ▽ More

    Submitted 26 October, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: EMNLP 2022

  24. arXiv:2109.13664  [pdf, other

    cs.CL cs.CY

    Multilingual Counter Narrative Type Classification

    Authors: Yi-Ling Chung, Marco Guerini, Rodrigo Agerri

    Abstract: The growing interest in employing counter narratives for hatred intervention brings with it a focus on dataset creation and automation strategies. In this scenario, learning to recognize counter narrative types from natural text is expected to be useful for applications such as hate speech countering, where operators from non-governmental organizations are supposed to answer to hate with several a… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

    Comments: To appear at the Workshop on Argument Mining 2021

  25. Social Analysis of Young Basque Speaking Communities in Twitter

    Authors: J. Fernandez de Landa, R. Agerri

    Abstract: In this paper we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combining social sciences with automatic text processing.… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Journal ref: Journal of Multilingual and Multicultural Development (2021)

  26. Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

    Authors: Elena Zotova, Rodrigo Agerri, German Rigau

    Abstract: Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some effor… ▽ More

    Submitted 28 January, 2021; originally announced January 2021.

    Comments: Stance detection, multilingualism, text categorization, fake news, deep learning

    Journal ref: Expert Systems with Applications, 170 (2021), Elsevier

  27. arXiv:2004.00050  [pdf, ps, other

    cs.CL

    Multilingual Stance Detection: The Catalonia Independence Corpus

    Authors: Elena Zotova, Rodrigo Agerri, Manuel Nuñez, German Rigau

    Abstract: Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingua… ▽ More

    Submitted 31 March, 2020; originally announced April 2020.

    Comments: Accepted at LREC 2020; 8 pages 10 tables

  28. arXiv:2004.00033  [pdf, ps, other

    cs.CL

    Give your Text Representation Models some Love: the Case for Basque

    Authors: Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

    Abstract: Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the… ▽ More

    Submitted 2 April, 2020; v1 submitted 31 March, 2020; originally announced April 2020.

    Comments: Accepted at LREC 2020; 8 pages, 7 tables

  29. arXiv:2001.06381  [pdf, other

    cs.CL

    A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings

    Authors: Iker García-Ferrero, Rodrigo Agerri, German Rigau

    Abstract: This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings. Our method integrates multiple word embeddings created from complementary techniques, textual sources, knowledge bases and languages. Existing word vectors are projected to a common semantic space using linear transformations and averaging. With our method the resulting meta-embeddings maintain the dime… ▽ More

    Submitted 8 September, 2021; v1 submitted 17 January, 2020; originally announced January 2020.

  30. Language Independent Sequence Labelling for Opinion Target Extraction

    Authors: Rodrigo Agerri, German Rigau

    Abstract: In this research note we present a language independent system to model Opinion Target Extraction (OTE) as a sequence labelling task. The system consists of a combination of clustering features implemented on top of a simple set of shallow local features. Experiments on the well known Aspect Based Sentiment Analysis (ABSA) benchmarks show that our approach is very competitive across languages, obt… ▽ More

    Submitted 28 January, 2019; originally announced January 2019.

    Comments: 17 pages

    Journal ref: Artificial Intelligence (2018), 268: 65-85

  31. arXiv:1810.00647  [pdf, ps, other

    cs.CL cs.IR

    Real Time Monitoring of Social Media and Digital Press

    Authors: Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri

    Abstract: Talaia is a platform for monitoring social media and digital press. A configurable crawler gathers content with respect to user defined domains or topics. Crawled data is processed by means of the EliXa Sentiment Analysis system. A Django powered interface provides data visualization for a user-based analysis of the data. This paper presents the architecture of the system and describes in detail i… ▽ More

    Submitted 15 January, 2019; v1 submitted 28 September, 2018; originally announced October 2018.

    Comments: Preprint submission, 35 pages (22 + references and Appendices)

    MSC Class: 68T35

  32. arXiv:1702.01944  [pdf, ps, other

    cs.CL

    EliXa: A Modular and Flexible ABSA Platform

    Authors: Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri

    Abstract: This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polarity classification (slot 3) is addressed by means o… ▽ More

    Submitted 7 February, 2017; originally announced February 2017.

    Comments: 5 pages, conference

    ACM Class: I.2.7; H.3.1

    Journal ref: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, June 2015, Denver, Colorado, pp.748-752

  33. arXiv:1702.01711  [pdf, ps, other

    cs.CL

    Q-WordNet PPV: Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages

    Authors: Iñaki San Vicente, Rodrigo Agerri, German Rigau

    Abstract: This paper presents a simple, robust and (almost) unsupervised dictionary-based method, qwn-ppv (Q-WordNet as Personalized PageRanking Vector) to automatically generate polarity lexicons. We show that qwn-ppv outperforms other automatically generated lexicons for the four extrinsic evaluations presented here. It also shows very competitive and robust results with respect to manually annotated ones… ▽ More

    Submitted 6 February, 2017; originally announced February 2017.

    Comments: 8 pages plus 2 pages of references

    Journal ref: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), pages 88-97, Gothenburg, Sweden, April 26-30 2014

  34. arXiv:1702.00700  [pdf, ps, other

    cs.CL cs.AI

    Multilingual and Cross-lingual Timeline Extraction

    Authors: Egoitz Laparra, Rodrigo Agerri, Itziar Aldabe, German Rigau

    Abstract: In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and cross-lingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering task presented at SemEval 2015 by specifying two ne… ▽ More

    Submitted 2 February, 2017; originally announced February 2017.

    Comments: 20 pages, 7 tables, 7 figures; submitted to Knowledge Based Systems (Elsevier), January, 2017

  35. Robust Multilingual Named Entity Recognition with Shallow Semi-Supervised Features

    Authors: Rodrigo Agerri, German Rigau

    Abstract: We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empirical experimentation how to effectively combine various types of clustering features allows us to seamlessly… ▽ More

    Submitted 31 January, 2017; originally announced January 2017.

    Comments: 26 pages, 19 tables (submitted for publication on September 2015), Artificial Intelligence (2016)

    Journal ref: Artificial Intelligence, 238, 63-82 (2016)