Skip to main content

Showing 1–10 of 10 results for author: Tadić, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.01753  [pdf, other

    cs.CL

    M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets

    Authors: Gaurish Thakkar, Sherzod Hakimov, Marko Tadić

    Abstract: In recent years, multimodal natural language processing, aimed at learning from diverse data types, has garnered significant attention. However, there needs to be more clarity when it comes to analysing multimodal tasks in multi-lingual contexts. While prior studies on sentiment analysis of tweets have predominantly focused on the English language, this paper addresses this gap by transforming an… ▽ More

    Submitted 12 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Journal ref: LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

  2. arXiv:2305.08187  [pdf, other

    cs.CL

    CroSentiNews 2.0: A Sentence-Level News Sentiment Corpus

    Authors: Gaurish Thakkar, Nives Mikelic Preradović, Marko Tadić

    Abstract: This article presents a sentence-level sentiment dataset for the Croatian news domain. In addition to the 3K annotated texts already present, our dataset contains 14.5K annotated sentence occurrences that have been tagged with 5 classes. We provide baseline scores in addition to the annotation process and inter-annotator agreement.

    Submitted 14 May, 2023; originally announced May 2023.

    Journal ref: Slavic NLP 2023

  3. arXiv:2305.08173  [pdf, ps, other

    cs.CL

    Croatian Film Review Dataset (Cro-FiReDa): A Sentiment Annotated Dataset of Film Reviews

    Authors: Gaurish Thakkar, Nives Mikelic Preradovic, Marko Tadić

    Abstract: This paper introduces Cro-FiReDa, a sentiment-annotated dataset for Croatian in the domain of movie reviews. The dataset, which contains over 10,000 sentences, has been annotated at the sentence level. In addition to presenting the overall annotation process, we also present benchmark results based on the transformer-based fine-tuning approach

    Submitted 14 May, 2023; originally announced May 2023.

    Journal ref: LTC 2023

  4. arXiv:2212.07429  [pdf, other

    cs.CL

    Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

    Authors: Diego Alves, Gaurish Thakkar, Gabriel Amaral, Tin Kuculo, Marko Tadić

    Abstract: With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2212.07162

  5. arXiv:2212.07162  [pdf, other

    cs.CL

    Building and Evaluating Universal Named-Entity Recognition English corpus

    Authors: Diego Alves, Gaurish Thakkar, Marko Tadić

    Abstract: This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. Th… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

  6. arXiv:2212.07160  [pdf, other

    cs.CL

    Multi-task Learning for Cross-Lingual Sentiment Analysis

    Authors: Gaurish Thakkar, Nives Mikelic Preradovic, Marko Tadic

    Abstract: This paper presents a cross-lingual sentiment analysis of news articles using zero-shot and few-shot learning. The study aims to classify the Croatian news articles with positive, negative, and neutral sentiments using the Slovene dataset. The system is based on a trilingual BERT-based model trained in three languages: English, Slovene, Croatian. The paper analyses different setups using datasets… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Journal ref: vol 2829,2021, 76-84

  7. arXiv:2010.12433  [pdf

    cs.CL

    Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages

    Authors: Diego Alves, Gaurish Thakkar, Marko Tadić

    Abstract: This article presents the strategy for develo** a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition andwith addition ofSentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about ma… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  8. arXiv:2010.12428  [pdf

    cs.CL

    Evaluating Language Tools for Fifteen EU-official Under-resourced Languages

    Authors: Diego Alves, Gaurish Thakkar, Marko Tadić

    Abstract: This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  9. arXiv:2010.12406  [pdf, other

    cs.CL

    UNER: Universal Named-Entity RecognitionFramework

    Authors: Diego Alves, Tin Kuculo, Gabriel Amaral, Gaurish Thakkar, Marko Tadic

    Abstract: We introduce the Universal Named-Entity Recognition (UNER)framework, a 4-level classification hierarchy, and the methodology that isbeing adopted to create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. First, the English SETimescorpus will be annotated using existing tools and knowledge bases. Afterevaluating the resulting annotations through crowdsou… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  10. arXiv:2003.13833  [pdf

    cs.CL cs.AI cs.DL

    The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

    Authors: Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin , et al. (22 additional authors not shown)

    Abstract: Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitu… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear