Skip to main content

Showing 1–4 of 4 results for author: Fransen, T

.
  1. arXiv:2403.04639  [pdf

    cs.CL

    MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis

    Authors: Priya Rani, Gaurav Negi, Theodorus Fransen, John P. McCrae

    Abstract: The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand… ▽ More

    Submitted 22 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Lrec-Colin 2024

  2. arXiv:2311.05155  [pdf, other

    cs.CL cs.AI

    Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

    Authors: Koustava Goswami, Priya Rani, Theodorus Fransen, John P. McCrae

    Abstract: Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023 conference

  3. arXiv:2108.06598  [pdf, other

    cs.CL

    Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages

    Authors: Atul Kr. Ojha, Chao-Hong Liu, Katharina Kann, John Ortega, Sheetal Shatam, Theodorus Fransen

    Abstract: We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following di… ▽ More

    Submitted 18 August, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

    Comments: 10 pages

  4. arXiv:2008.01545  [pdf, other

    cs.CL

    ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text

    Authors: Koustava Goswami, Priya Rani, Bharathi Raja Chakravarthi, Theodorus Fransen, John P. McCrae

    Abstract: Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contribu… ▽ More

    Submitted 27 July, 2020; originally announced August 2020.

    Comments: To be published in 14th International Workshop on Semantic Evaluation SemEval-2020