Skip to main content

Showing 1–8 of 8 results for author: Zosa, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.01856  [pdf, other

    cs.CL

    Poro 34B and the Blessing of Multilinguality

    Authors: Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

    Abstract: The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on indiv… ▽ More

    Submitted 24 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  2. arXiv:2403.07726  [pdf, other

    cs.CL

    SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

    Authors: Timothee Mickus, Elaine Zosa, Raúl Vázquez, Teemu Vahtola, Jörg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki

    Abstract: This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 ann… ▽ More

    Submitted 29 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: SemEval 2024 shared task. Pre-review version

  3. arXiv:2310.11938  [pdf, other

    cs.CL

    Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding

    Authors: Timothee Mickus, Elaine Zosa, Denis Paperno

    Abstract: Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or agains… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: accepted to Findings of EMNLP 2023

  4. arXiv:2211.08057  [pdf, other

    cs.CL cs.AI

    Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

    Authors: Elaine Zosa, Lidia Pivovarova

    Abstract: This paper presents M3L-Contrast -- a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images and takes advantage of pretrained document and image embeddings to abstract the complexities between different languages and modalities. As a multilingual topic… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Published in COLING 2022 Proceddings

    ACM Class: I.2.7

  5. arXiv:2109.10033  [pdf, other

    cs.CL

    Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model

    Authors: Elaine Zosa, Ravi Shekhar, Mladen Karan, Matthew Purver

    Abstract: Moderation of reader comments is a significant problem for online news platforms. Here, we experiment with models for automatic moderation, using a dataset of comments from a popular Croatian newspaper. Our analysis shows that while comments that violate the moderation rules mostly share common linguistic and thematic features, their content varies across the different sections of the newspaper. W… ▽ More

    Submitted 21 September, 2021; originally announced September 2021.

    Comments: Accepted to RANLP 2021

  6. arXiv:2103.14969  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Catalyzing Clinical Diagnostic Pipelines Through Volumetric Medical Image Segmentation Using Deep Neural Networks: Past, Present, & Future

    Authors: Teofilo E. Zosa

    Abstract: Deep learning has made a remarkable impact in the field of natural image processing over the past decade. Consequently, there is a great deal of interest in replicating this success across unsolved tasks in related domains, such as medical image analysis. Core to medical image analysis is the task of semantic segmentation which enables various clinical workflows. Due to the challenges inherent in… ▽ More

    Submitted 12 May, 2021; v1 submitted 27 March, 2021; originally announced March 2021.

    Comments: Review paper written for the UCSD PhD Research Mastery Exam; June 7, 2019

  7. arXiv:2011.10428  [pdf, other

    cs.CL

    Topic modelling discourse dynamics in historical newspapers

    Authors: Jani Marjanen, Elaine Zosa, Simon Hengchen, Lidia Pivovarova, Mikko Tolonen

    Abstract: This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed… ▽ More

    Submitted 20 November, 2020; originally announced November 2020.

  8. Capturing Evolution in Word Usage: Just Add More Clusters?

    Authors: Matej Martinc, Syrielle Montariol, Elaine Zosa, Lidia Pivovarova

    Abstract: The way the words are used evolves through time, mirroring cultural or technological evolution of society. Semantic change detection is the task of detecting and analysing word evolution in textual data, even in short periods of time. In this paper we focus on a new set of methods relying on contextualised embeddings, a type of semantic modelling that revolutionised the NLP field recently. We leve… ▽ More

    Submitted 23 January, 2020; v1 submitted 18 January, 2020; originally announced January 2020.

    Journal ref: WWW 20 Companion Proceedings of the Web Conference 2020 (April 2020) p. 343-349